RWKV-papers | EleutherAI | Page 2

last mauve May 17, 2023, 4:17 PM

#

Yes or no: Are you "Xiangru Tang" in overleaf?

zealous snow May 17, 2023, 4:17 PM

#

yes

last mauve May 17, 2023, 4:18 PM

#

zealous snow yes

ok. Please lemme finish setting up before you edit

tropic minnow May 17, 2023, 4:47 PM

#

added points in contributions section for that and rephrased/shortened mine.

#

created a motivation section after intro with some bullet points. feel free to edit

last mauve May 17, 2023, 4:51 PM

#

Thanks @tropic minnow! I'll take a look at these

#

I just added an author list. Lemme know if I missed anyone.

It's formatted kinda ugly rn. If @zealous snow or anyone wants to take a crack at making it less ugly, feel free.

pale nexus May 17, 2023, 4:52 PM

#

tropic minnow created a `motivation` section after intro with some bullet points. feel free to...

could be interesting to say that RWKV has linear attention without any approximation unlike linformer and co ?

tropic minnow May 17, 2023, 4:57 PM

#

pale nexus could be interesting to say that RWKV has linear attention without any approxima...

see this (draft) sentence in 2.motivation: => Address quadratic cost of attention by a reformulation to get "scalar attention" with linear cost

pale nexus May 17, 2023, 4:58 PM

#

maybe add "with no approximation involved`. I think this is important because I believe when you scale your model, approximations start to take a lot of importance

#

while here, if there is no approximation, scaling shouldnt be a "problem" (at least you are not limited by your attention calculation)

zealous snow May 17, 2023, 4:59 PM

#

last mauve I just added an author list. Lemme know if I missed anyone. It's formatted kin...

thanks, let me try to make it more ugly, or hopefully chatgpt could help me make it less

tropic minnow May 17, 2023, 5:01 PM

#

pale nexus maybe add "with no approximation involved`. I think this is important because I ...

good point. you're free to convert the bullet points into text as u find better; we can always revise/discuss later. i formulated the current one this way bc didnt want to give the impression we're computing the QK attention, but our own variant of "scalar attn". So imo its not an approximation but its not the transformer formulation either.

neon night May 17, 2023, 5:36 PM

#

Some long author list papers from EMNLP https://arxiv.org/pdf/2109.04650.pdf https://arxiv.org/pdf/2104.08200.pdf

spiral minnow May 17, 2023, 5:40 PM

#

Shouldn't we also have author affiliations?

outer vine May 17, 2023, 5:45 PM

#

is there anyone working on in-context learning examples?

pale nexus May 17, 2023, 5:46 PM

#

i think does @rustic rivet

rustic rivet May 17, 2023, 5:48 PM

#

outer vine is there anyone working on in-context learning examples?

I have a short gist demo showing how rwkv can read a paragraph and store the state variable, then you can ask a lot of questions by utilizing the state

#

But this might not be a formal in context learning example, more like a showing off with the statefulness of this LM

#

https://gist.github.com/jiamingkong/41d0bcf1f52be104a335bd0fa288c407

Gist

Proof of Concept for State Caching in RWKV

Proof of Concept for State Caching in RWKV. GitHub Gist: instantly share code, notes, and snippets.

outer vine May 17, 2023, 5:53 PM

#

i am not quite sure what kind of ICL examples need in the paper

rustic rivet May 17, 2023, 5:53 PM

#

Microsoft published a paper on why ICL work and they believe it's the attention mechanism inside shifted the attention to mimic a meta optimizer. By tweaking with the attention further, they verified the conjecture some how

#

But for us there is no attention, however we can still few shot, then save the state, as shown in the gist

outer vine May 17, 2023, 5:54 PM

#

is this something we want to put in the paper? I remember @obsidian quest has mentioned this, not sure if any follow-ups

rustic rivet May 17, 2023, 5:56 PM

#

rustic rivet Microsoft published a paper on why ICL work and they believe it's the attention ...

https://arxiv.org/abs/2212.10559

arXiv.org

Why Can GPT Learn In-Context? Language Models Implicitly Perform Gr...

Large pretrained language models have shown surprising in-context learning
(ICL) ability. With a few demonstration input-label pairs, they can predict the
label for an unseen input without parameter updates. Despite the great success
in performance, its working mechanism still remains an open question. In this
paper, we explain language models a...

#

I am also trying to visualize these inside the state variable, to uncover how timemix and channelmix meta optimized the model, too. However this feels like a follow-up blog instead of the current overleaf manuscript for a major introduction

#

This is as far as I go now

outer vine May 17, 2023, 6:00 PM

#

this is interesting if you could link these two together, meta optimizer in RWKV

rustic rivet May 17, 2023, 6:01 PM

#

Considering the whole meta thingy came from next token prediction training it's really fascinating

outer vine May 17, 2023, 6:02 PM

#

yes, this paper has been accepted as ACL2023 findings

#

and this is also a concurrent work, https://arxiv.org/abs/2212.07677

arXiv.org

Transformers learn in-context by gradient descent

Transformers have become the state-of-the-art neural network architecture
across numerous domains of machine learning. This is partly due to their
celebrated ability to transfer and to learn in-context based on few examples.
Nevertheless, the mechanisms by which Transformers become in-context learners
are not well understood and remain mostly an...

tough crane May 17, 2023, 6:09 PM

#

By the way, it seems to be a comment in compiled PDF as blue colored chars

last mauve May 17, 2023, 6:12 PM

#

spiral minnow Shouldn't we also have author affiliations?

Yes

#

I just wanted to get an author list up so people could add themselves and point out issues before we're deadline-constrained

spiral minnow May 17, 2023, 6:17 PM

#

last mauve I just wanted to get an author list up so people could add themselves and point ...

Okay, maybe we can have folks add their affiliation to the contributions list then and we can add it so it looks nice later

last mauve May 17, 2023, 6:17 PM

#

spiral minnow Okay, maybe we can have folks add their affiliation to the contributions list th...

Yes. Please do this, all

outer vine May 17, 2023, 6:21 PM

#

What is your opinion about ICL examples? @last mauve @spiral minnow

chilly niche May 17, 2023, 6:22 PM

#

Is there anything I can help out with? Y'all need some SuperGLUE fine-tuning experiments? :p

spiral minnow May 17, 2023, 6:24 PM

#

outer vine What is your opinion about ICL examples? <@367104793292046338> <@106618166094253...

Sorry if I'm a bit out of the loop, haven't been in the office for a few days. What exactly is the question here? Whether to include some case studies showing the ability of RWKV to do ICL?

outer vine May 17, 2023, 6:25 PM

#

yes, as required here

#

#1103039376184852622 message

#

here

spiral minnow May 17, 2023, 6:32 PM

#

Yes, I do think some examples of the model output would be very beneficial to the paper. It currently has a lot of quantitative analysis but is lacking qualitative analysis

#

Do we have some example outputs from LAMBADA? It looks like the paper is very nearly full at the moment, so maybe we can add a bunch of example outputs in the appendix, but just highlight 1-2 of them in the main paper.
It would be really good to have some example continuations that demonstrate the key qualities of this model: fluent and coherent text continuations that maintain quality over long contexts

outer vine May 17, 2023, 6:38 PM

#

in that case, maybe this could go beyond ICL examples. I would try if i could find something. I would first put it in the appendix.

#

BTW, do we have a RWKV icon now?

pale nexus May 17, 2023, 6:49 PM

#

blinkdl profile picture lol

young sparrow May 17, 2023, 6:51 PM

#

The kitty cat?

#

burnt gulch May 17, 2023, 6:53 PM

#

outer vine BTW, do we have a RWKV icon now?

yeah we do, it's a raven, it's used in the huggingface integration

#

https://twitter.com/huggingface/status/1658054038879870977

Hugging Face (@huggingface)

The first RNN in transformers! 🤯
Announcing the integration of RWKV models in transformers with @BlinkDL_AI and RWKV community!
RWKV is an attention free model that combines the best from RNNs and transformers.
Learn more about the model in this blogpost: https://t.co/0FQmsaRVZw

Likes

1149

Retweets

265

outer vine May 17, 2023, 6:58 PM

#

cool, do you have the original image? maybe we could put it in the showcase

burnt gulch May 17, 2023, 6:59 PM

#

@outer vine

tender karma May 17, 2023, 7:05 PM

#

spiral minnow Okay, maybe we can have folks add their affiliation to the contributions list th...

Would you pls provide an example so we can follow a consistent format?

last mauve May 17, 2023, 7:07 PM

#

tender karma Would you pls provide an example so we can follow a consistent format?

Affiliation: EleutherAI

spiral minnow May 17, 2023, 7:08 PM

#

For me it's University of California, Santa Barbara

last mauve May 17, 2023, 7:08 PM

#

outer vine What is your opinion about ICL examples? <@367104793292046338> <@106618166094253...

I agree with @spiral minnow here

spiral minnow May 17, 2023, 7:09 PM

#

tender karma Would you pls provide an example so we can follow a consistent format?

If you aren't part of an academic institution, you can use your company (if they agree to it). Or if you have no institution, maybe we can ask @young sparrow if it's okay to use the EleutherAI affiliation

young sparrow May 17, 2023, 7:09 PM

#

spiral minnow If you aren't part of an academic institution, you can use your company (if they...

I think the request was more about the general contributions statement than how to word an affiliation
Everyone is welcome to use an EleutherAI affiliation if they wish to

spiral minnow May 17, 2023, 8:27 PM

#

It looks like we're using multiple phrases to refer to the attention used in this work (scalar attention and linear attention). I think it would be a good idea to concentrate on only one of those terms to not confuse readers. I'm not sure why it's referred to as scalar attention though, as far as I can tell it's actually a vector?

last mauve May 17, 2023, 8:49 PM

#

chilly niche Is there anything I can help out with? Y'all need some SuperGLUE fine-tuning exp...

There's nothing I can think of at the moment. We're more focusing on tightening up the storyline for now.

There'll definitely be some followup papers though, so check in after this goes to EMNLP.

tropic minnow May 17, 2023, 8:55 PM

#

anyone knows the author of the last 2 sentences in 5.7 Context? overleaf username: kinetical

tropic minnow May 17, 2023, 9:29 PM

#

In the context of LLM applications, injecting the context into the model is equivalent to prompt engineering or p-Tuning(Liu et al., 2022). This feature enables one copy of RWKV to serve multiple domains or purposes with an implementation of state cache, minimizing computation overhead essentially these lines

tender karma May 17, 2023, 9:42 PM

#

@tropic minnow I reviewed the 5.7 Context referencing to the Appendix for details and clarifying the concept in that sentence

broken moth May 17, 2023, 10:15 PM

#

I don't understand the current Table 2. Actually, there are two tables with the same "tab:model_flop_count" label. Is it just a placeholder for the inference results?

mortal latch May 18, 2023, 1:47 AM

#

broken moth I don't understand the current Table 2. Actually, there are two tables with the ...

Same.

#

I made a pass of the article. It seems that the introduction and the motivation overlaps quite a bit. Maybe we should merge then into a more concise section?

neon night May 18, 2023, 3:21 AM

#

I just realized token shift is not exactly a residual connection, but more like the structure of casual convolution in WaveNet 🤯

gusty condor May 18, 2023, 3:50 AM

#

A minor typo: should be "LAMBADA" not "LAMGDA"

#

neon night May 18, 2023, 4:35 AM

#

A revised section 5.6 is available.

neon night May 18, 2023, 5:39 AM

#

I think my part of work is done. I prefer to use an Eleuther AI affiliation. 😁

fickle hare May 18, 2023, 5:40 AM

#

broken moth I don't understand the current Table 2. Actually, there are two tables with the ...

+1. It's unclear what the numbers in the first row represent.

#

besides, the caption of Figure 5 lacks information on what kind of test the figure is representing.

tender karma May 18, 2023, 5:46 AM

#

neon night A revised section 5.6 is available.

I like it it is more robust for the paper. Still, I think we can maintain some soft statement like "% Intuitively, by assigning each token the dual tasks of (1) aggregating all previous information and (2) predicting the next token, shifted channels can focus on the former task, enhancing information propagation." or so

neon night May 18, 2023, 5:59 AM

#

tender karma I like it it is more robust for the paper. Still, I think we can maintain some s...

I guess this is for the old version of token shift that replaces half of the channels by the previous channel. 🤔

tough crane May 18, 2023, 5:59 AM

#

gusty condor

Fixed.

neon night May 18, 2023, 6:02 AM

#

neon night I just realized token shift is not exactly a residual connection, but more like ...

I think even in the old version, token shift cannot "aggregate all previous information" in a single layer. It relies multiple layers to do so. Like WaveNet.

tender karma May 18, 2023, 6:13 AM

#

neon night I think even in the old version, token shift cannot "aggregate all previous info...

Fear enough. Agreed 👍🏼

#

I propose to cut off section 8. To make it effective I would insert comparison graphs for each experimented task but not bringing significant value at the end. I still like the concept of that section, however.

gusty condor May 18, 2023, 6:23 AM

#

Also, excuse me, but I think that the description in the LAMBADA is not accurate enough. AFAIK, there is not "a set of candidate words" or something. LAMBADA is an open cloze where one needs to guess the last word of the target sentence by context, without given any choices.

tropic minnow May 18, 2023, 7:12 AM

#

broken moth I don't understand the current Table 2. Actually, there are two tables with the ...

One of the tables there is indeed a placeholder for inference results. On the phone now so cant check number. Will be updating it today

fickle hare May 18, 2023, 7:20 AM

#

Is Section 8 unfinished?

tender karma May 18, 2023, 7:23 AM

#

fickle hare Is Section 8 unfinished?

If for section 8 you mean Fundamental Experiments, yes it is unfinished as it would take much space to insert graphs comparing to LSTM and GRU without creating significant benefit. I commented it.

#

Please all, in the Author Contributions use labels and not explicit numbers 🙃

fickle hare May 18, 2023, 7:25 AM

#

#

I mean this one

tender karma May 18, 2023, 7:25 AM

#

fickle hare

ah ok, the new 8 🙂

fickle hare May 18, 2023, 7:29 AM

#

@uneven blade would you mind adding a causal trace for the same example using some transformer model, to provide a comparison against the transformer about the information propagation?

#

And is it LAMBADA or LAMBDA? It's renamed to LAMBDA throughout everywhere now, even including the file name acc_lambda.png

pale nexus May 18, 2023, 7:34 AM

#

lambada

fickle hare May 18, 2023, 7:36 AM

#

LAMBDA without A occurs in Section 6, Figure 4 caption, Appendix H, and several labels and file names

tough crane May 18, 2023, 7:39 AM

#

fickle hare LAMBDA without A occurs in Section 6, Figure 4 caption, Appendix H, and several ...

I substitute them to LAMBADA .

#

Why are section 2 Motivation and section 1 Introduction separated?

tropic minnow May 18, 2023, 8:24 AM

#

fickle hare

yes it is. it needs a plot and a reference to an appendix where a table will capture the numbers

neon night May 18, 2023, 8:41 AM

#

fickle hare <@618160617580134411> would you mind adding a causal trace for the same example ...

I'll DM @uneven blade the plotting script I use.

fickle hare May 18, 2023, 8:52 AM

#

The Scaling Laws figure (currently Figure 6) seems lossy. May someone plot svg/pdf for the three plots?

neon night May 18, 2023, 8:54 AM

#

I think it is better to use the same color scheme as the referenced paper.

fickle hare May 18, 2023, 8:55 AM

#

maybe gather the plotting script and redo all the plots

broken moth May 18, 2023, 10:00 AM

#

tough crane Why are section 2 Motivation and section 1 Introduction separated?

I am also in favor of combining the Introduction with Motivation. I can take care of this if you think it makes sense. We will save a lot of space.

tropic minnow May 18, 2023, 10:01 AM

#

broken moth I am also in favor of combining the Introduction with Motivation. I can take car...

ask @last mauve

obsidian quest May 18, 2023, 10:10 AM

#

neon night I just realized token shift is not exactly a residual connection, but more like ...

it's like a tiny convolution

neon night May 18, 2023, 10:15 AM

#

You can also call it temporal residual connection, I've searched this term and some video AI papers do use this concept.

obsidian quest May 18, 2023, 10:21 AM

#

gusty condor

all RWKV models are trained with ctx1024 by default, and then some of them are finetuned to longer ctxlens

Note longer ctxlen usually slightly hurts (!) these benchmark tasks because they only care abt short ctxlens

#

Note long ctx models have seen more tokens (1+ epoch)

    params    LAMBADA    AVERAGE    LAMBADA    PIQA    StoryCloze16    Hellaswag    WinoGrande    arc_challenge    arc_easy    headQA    openbookQA    sciq    triviaQA    ReCoRD    COPA
RWKV-4,ctx1k    3    5.24     57.52%    63.94%    73.72%    70.28%    59.63%    59.43%    31.83%    64.27%    28.74%    37.60%    85.70%    11.07%    80.56%    81.00%
RWKV-4,ctx4k    3    5.25     57.93%    63.96%    74.16%    70.71%    59.89%    59.59%    33.11%    65.19%    28.45%    37.00%    86.50%    11.68%    80.87%    82.00%


    params    LAMBADA    AVERAGE    LAMBADA    PIQA    StoryCloze16    Hellaswag    WinoGrande    arc_challenge    arc_easy    headQA    openbookQA    sciq    triviaQA    ReCoRD    COPA
RWKV-4,ctx1k    14.2    3.81     63.54%    71.05%    77.42%    75.57%    70.24%    62.98%    38.31%    70.71%    32.28%    40.60%    90.10%    24.06%    85.73%    87.00%
RWKV-4,ctx4k    14.2    3.88     63.46%    70.10%    77.64%    75.52%    70.66%    64.17%    38.82%    70.29%    32.35%    40.40%    89.90%    24.42%    85.67%    85.00%
RWKV-4,ctx8k    14.2    3.86     63.71%    70.83%    77.48%    76.06%    70.65%    63.85%    38.99%    70.24%    32.64%    41.80%    90.40%    24.58%    85.67%    85.00%

#

However the 14B ctx8k model seems quite better when interacting with users
This can not be shown in any current benchmark tasks unfortunately

outer vine May 18, 2023, 10:28 AM

#

hi @obsidian quest , do you have some personally preferred cases/examples to be shown in the paper?

obsidian quest May 18, 2023, 10:28 AM

#

@uneven blade has plenty of cool examples

rustic rivet May 18, 2023, 10:32 AM

#

#

For some reason RWKV is somehow very good with math, especially marking-down things @obsidian quest

outer vine May 18, 2023, 10:32 AM

#

cool, just put here and i will make it on the paper appendix

#

i believe examples with long ctx would be more illuminating

outer vine May 18, 2023, 10:34 AM

#

rustic rivet

what is the model size?

rustic rivet May 18, 2023, 10:36 AM

#

outer vine what is the model size?

14b

tropic minnow May 18, 2023, 11:16 AM

#

obsidian quest However the 14B ctx8k model seems quite better when interacting with users This ...

would be interesting to test RWKV performance on long range arena, but perhaps out of scope for this paper

last mauve May 18, 2023, 11:34 AM

#

broken moth I am also in favor of combining the Introduction with Motivation. I can take car...

I'm not sure what's meant by combine here. If the motivation and intro have lots of overlap, move the intro material from motivation and the motivation material from intro. Then remove duplicates.

neon night May 18, 2023, 11:56 AM

#

broken moth I am also in favor of combining the Introduction with Motivation. I can take car...

This is how to make Introduction and Motivation section not overlap:
Introduction starts with "The rapid advancements in..." basically positive things; Motivation starts with a twist "Despite the significant progress..."

#

However, I do think the total space of Introduction and Motivation needs to be constrained

#

Base on our title "RWKV: Reinventing RNNs for the Transformer Era", the Introduction part should immediately address aspects like the first coming of RNN and the Transformer Era we are now in.

broken moth May 18, 2023, 12:01 PM

#

I can make a copy of the current intro/motivation somewhere at the end and propose the shorter variant of both without duplicated information

neon night May 18, 2023, 12:48 PM

#

You can also add a contribution part, basically anything that is not introduction and motivation goes into contribution

broken moth May 18, 2023, 1:38 PM

#

Who is atsushi.saito.dec17? I currently work on the introduction/motivation, but I see a lot of changes going on. I am not sure that it is a good idea to remove the names of most recognizable LLMs (GPT-3, GPT-4, ChatGPT, LLaMA) if we want this paper to be easily found on Google Scholar

paper dove May 18, 2023, 1:47 PM

#

fickle hare The Scaling Laws figure (currently Figure 6) seems lossy. May someone plot svg/p...

I have the code, I can replot

#

@rich raptor

rich raptor May 18, 2023, 1:47 PM

#

paper dove I have the code, I can replot

sure

young sparrow May 18, 2023, 1:55 PM

#

Wow this paper is coming along really well, y'all're doing great work.

#

I can go through and do an editing pass, leaving comments and suggestions, later today

tough crane May 18, 2023, 2:09 PM

#

broken moth Who is atsushi.saito.dec17? I currently work on the introduction/motivation, but...

atsushi.saito.dec17 is my account. I did not deleted these cites of GPT-3, GPT-4, ChatGPT, LLaMA but Overleaf is sometime wired if we are editing at the same time.

broken moth May 18, 2023, 2:21 PM

#

I didn't mean citations, but model names, Google indexes more by the paper's content.

For the moment, I found that it would be difficult to separate Motivation from Introduction so directly. It prolongs the content because it is hard to avoid repeating the information from Introduction in Motivation. I added a paragraph before the contribution to include the most important part of Motivation. It can be extra separated, but let me know if you need it done.

regal basalt May 18, 2023, 2:23 PM

#

Um, how many output instances will be presented?

young sparrow May 18, 2023, 2:26 PM

#

regal basalt Um, how many output instances will be presented?

A page or two is common, but it doesn’t really matter? Like, however much people want to do

tough crane May 18, 2023, 2:31 PM

#

broken moth I didn't mean citations, but model names, Google indexes more by the paper's con...

Thanks!! I'm reading the first two paragraphs in section 1. and I am not editing the first two pages now...

outer vine May 18, 2023, 2:38 PM

#

regal basalt Um, how many output instances will be presented?

for now, feel free to add cases to the appendix

regal basalt May 18, 2023, 2:38 PM

#

tough crane May 18, 2023, 2:44 PM

#

gusty condor Also, excuse me, but I think that the description in the LAMBADA is not accurate...

Fixed as "to predict the most probable target token."

tropic minnow May 18, 2023, 2:50 PM

#

all, pls if you see some issue or conflict or have some suggestion, pls use the comment feature (select a text -> right click -> comment) to provide non-urgent feedback before changing if possible

young sparrow May 18, 2023, 3:15 PM

#

tropic minnow all, pls if you see some issue or conflict or have some suggestion, pls use the ...

Track changes is on, but basically the entire text is marked as changed. If we accept all the changes that’ll make tracking new changes easier as well

regal basalt May 18, 2023, 3:34 PM

#

I'll fix the formatting in the cases later

outer vine May 18, 2023, 3:55 PM

#

duplicate

mortal latch May 18, 2023, 4:28 PM

#

For Figure 4, could it be converted to .pdf format with larger font size? Now it is hard to read the texts in the picture. Same for Figure 5, 6 and 8. I don't mind fixing them if anyone can share the plotting script.

obsidian quest May 18, 2023, 5:59 PM

#

In Table 4, we should only compare RWKV 14B with "GPT-level" 14B (which is an interpolation of Pythia and NeoX numbers)

#

In Cases J, show the last 3 samples + a coding sample + a chat sample

#

Mention RWKV-4 tricks to solve exp(k) overflow

#

Figure 6 needs to be vectorized

#

should we add https://github.com/BlinkDL/RWKV-LM and https://github.com/BlinkDL/ChatRWKV somewhere

rustic rivet May 18, 2023, 6:16 PM

#

obsidian quest Mention RWKV-4 tricks to solve exp(k) overflow

I believe the numerical trick is in the 150 line version of RWKV, right? from line 80 to line 90 that a variable qq is first subtracted from pp and ww.

obsidian quest May 18, 2023, 6:18 PM

#

yes. see https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v3-plan.png

rustic rivet May 18, 2023, 6:20 PM

#

OK let me try to do this in a draft

#

returning the favor for once explaining the difference of 100 line version and 150 line version kindly

obsidian quest May 18, 2023, 6:23 PM

#

How's the RWKV scaling law comparing with GPT

last mauve May 18, 2023, 6:24 PM

#

Ok all of these are complete except for improving high-level coherence and the inference results (@tropic minnow -- Daily check-in here. Can we get these by Friday or Saturday?).

Some new small work items:
1. Figures 4-6 are too late by a page. Can we bring these up closer to their content?
2. Most people haven't included their affiliations to their contributor appendix section (e.g. "Affiliation: EleutherAI"). If you don't have an organization, university, or company that you'd like to link to this work, you're welcome to put EleutherAI. PLEASE GET THIS DONE WITHIN THE NEXT TWO DAYS. Also, if you contributed to this work and haven't put your contribution section, do so within the next two days. If you forget, we won't be able to add your name to the arxiv release until after the EMNLP double-blind deadline.
3. The first-page author block needs affiliations added. If someone could take care of that it'd help
4. The contributions need numbered, and the section describing each contribution should be added to the list item.
5. New paragraphs are all indented. Someone needs to go through and add \noindent or something to remove these.
6. Minor nit, but I think Figure 8 is low resolution?

last mauve May 18, 2023, 6:25 PM

#

obsidian quest In Table 4, we should only compare RWKV 14B with "GPT-level" 14B (which is an in...

All of these items also need handled as well

obsidian quest May 18, 2023, 6:25 PM

#

Bo PENG: built RWKV and scaled it from 0.1B to 14B.
Affiliation: can I write my github link 😉

tropic minnow May 18, 2023, 6:26 PM

#

last mauve Ok all of these are complete except for improving high-level coherence and the i...

yes we can ( 😆 ). have them in CSVs. need to organize and clean

last mauve May 18, 2023, 6:27 PM

#

obsidian quest Bo PENG: built RWKV and scaled it from 0.1B to 14B. Affiliation: can I write my ...

No, affiliation needs to be some entity. If you have your own entity, you're free to point to that, but if it's just a link to your personal github it would read as "Affiliation: Myself".

If you want to go that route, we can just leave affiliation off your name entirely. You're also free to make up a new entity for your RWKV work.

last mauve May 18, 2023, 6:29 PM

#

obsidian quest Bo PENG: built RWKV and scaled it from 0.1B to 14B. Affiliation: can I write my ...

And you came up with the RWKV idea right? I'm thinking:

Bo PENG: Invented, built the model and training code, and trained RWKV model scaling suite.

last mauve May 18, 2023, 6:29 PM

#

tropic minnow yes we can ( 😆 ). have them in CSVs. need to organize and clean

Excellent

tough crane May 18, 2023, 6:32 PM

#

mortal latch For Figure 4, could it be converted to `.pdf` format with larger font size? Now ...

I'm not sure who made the Fig 4. But Fig 5 is based on @obsidian quest 's excel sheet

obsidian quest May 18, 2023, 6:34 PM

#

last mauve No, affiliation needs to be some entity. If you have your own entity, you're fre...

Affiliation: RWKV Foundation (non-existent as of now, will be a nonprofit in the spirit of Linux Foundation)

created RWKV, built the model and training code, optimized its performance, and trained RWKV models from 0.1B to 14B.

last mauve May 18, 2023, 6:34 PM

#

obsidian quest Affiliation: RWKV Foundation (non-existent as of now, will be a nonprofit in the...

Yep this works. We'll add it.

tough crane May 18, 2023, 6:35 PM

#

obsidian quest Affiliation: RWKV Foundation (non-existent as of now, will be a nonprofit in the...

Linux has penguin 🐧 and your one has 🐱

#

Does anyone know the person who made Fig4??

#

Probably BlinkDL paste the Fig5's spread sheet in this channel. Where??

obsidian quest May 18, 2023, 6:48 PM

#

We need to give more credit to Attention Free Transformer because it is an inspiration of RWKV.

obsidian quest May 18, 2023, 6:50 PM

#

tough crane Probably BlinkDL paste the Fig5's spread sheet in this channel. Where??

#1103039376184852622 message

rustic rivet May 18, 2023, 7:02 PM

#

@obsidian quest I rewrote your recursion into formula as in equation 27 to 33, can you confirm I didn't miss anything?

#

obsidian quest May 18, 2023, 7:04 PM

#

rustic rivet

q = max(k-w, k)

last mauve May 18, 2023, 7:06 PM

#

obsidian quest We need to give more credit to Attention Free Transformer because it is an inspi...

7. Can someone add this to Related Work?

mortal latch May 18, 2023, 7:07 PM

#

last mauve **7.** Can someone add this to Related Work?

Working on it. ATF is already in Related Works but it is worthwhile to talk more about it then.

last mauve May 18, 2023, 7:07 PM

#

mortal latch Working on it. ATF is already in Related Works but it is worthwhile to talk more...

Sure. Just a sentence or two max.

obsidian quest May 18, 2023, 7:09 PM

#

AFT: introduces the sigmoid gate (called receptance in RWKV) in linear attention

#

and the sum(exp(K) V) / sum(exp(K)) formulation

tropic minnow May 18, 2023, 7:16 PM

#

tough crane Does anyone know the person who made Fig4??

@rich raptor

tropic minnow May 18, 2023, 7:19 PM

#

last mauve **7.** Can someone add this to Related Work?

we have it in Attention Free Models paragraph. i guess we can make more explicit mentions when we talk about components of RWKV throughout the text

rustic rivet May 18, 2023, 7:21 PM

#

@obsidian quest I re-read your shared png and realized that I got it wrong the first time, here is a correction:

#

#

the starting point for the recursion is:

#

obsidian quest May 18, 2023, 7:22 PM

#

yeah now it's correct

rustic rivet May 18, 2023, 7:22 PM

#

compared to your shared new RNN formula, the only difference is that in your PNG the sign for w is positive, and in here it's negative (I followed the notations earlier in the paper)

#

Cool

#

It's in the paper now, in Appendix B right after introducing the RNN cell.

obsidian quest May 18, 2023, 7:28 PM

#

In Cases J, show 2 typical responses for each question using https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio model

ChatRWKV - a Hugging Face Space by BlinkDL

#

Can mention https://github.com/ridgerchu/SpikeGPT which shows RWKV is good for Spiking Neural Networks too

GitHub

GitHub - ridgerchu/SpikeGPT: Implementation of "SpikeGPT: Generativ...

Implementation of "SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks" - GitHub - ridgerchu/SpikeGPT: Implementation of "SpikeGPT: Generative Pr...

last mauve May 18, 2023, 7:33 PM

#

@obsidian quest -- Can you do a pass and make sure there are no technical errors in any figures/equations?

rustic rivet May 18, 2023, 7:34 PM

#

#

I just put the equations I just added into the huggingface link for a code implementation, damn

obsidian quest May 18, 2023, 7:35 PM

#

yeah this will be a cool example

rustic rivet May 18, 2023, 7:36 PM

#

📎 cool_example.md

#

Saved

#

I didn't even have to cherry pick

#

and it just converted latex into torch

grim linden May 18, 2023, 7:46 PM

#

I have a quick newbie question out of curiosity: can RWKV be seen as an instance of a GNN

rustic rivet May 18, 2023, 7:48 PM

#

hmm no. RWKV don't featurize "vertices" nor "edges" and it doesn't have very strong locality inductive bias as typical GNNs

tender karma May 18, 2023, 9:05 PM

#

Does anyone want to fix this? If not, I can proceed. "RWKV is a large language model (LLM) architecture that can be both trained in GPT mode \cite{Radford2019LanguageMA_gpt2} and formulated as an RNN for lighter and faster inference."

#

RWKV enables the development of LLM, it is not a LLM per se

#

In addition, this "GPT" mode is something we understand because it is in our own project vocabulary... do you think is clear said that way?

tropic minnow May 18, 2023, 9:27 PM

#

tender karma RWKV enables the development of LLM, it is not a LLM per se

see my comment next to that

burnt gulch May 18, 2023, 11:47 PM

#

tender karma In addition, this "GPT" mode is something we understand because it is in our own...

the idea of GPT mode was sorta confusing the first time when I looked into this. We should define it, the gist of it ircc is that we have all the tokens available to us so we can train in parallel, theres apart of the training that requires the scan operation I don't remember off the top of my head though

misty cedar May 19, 2023, 12:11 AM

#

gpt mode also allows for building the state from a set of tokens in one forward pass

gusty condor May 19, 2023, 4:54 AM

#

This response is incomplete and contains little information.

#

Should it be removed?

#

Also, PIQA, which stands for "Physical Interaction: Question Answering", should be totally capitalized, "PIQA" not "PiQA".

tough crane May 19, 2023, 5:12 AM

#

gusty condor Also, PIQA, which stands for "Physical Interaction: Question Answering", should ...

Fixed as PIQA

gusty condor May 19, 2023, 5:28 AM

#

I doubt whether we need concrete examples to demonstrate this part of the limitation. What do carefully designed prompts look like? How do responses vary by different prompts?

outer vine May 19, 2023, 5:28 AM

#

gusty condor This response is incomplete and contains little information.

this is just a template holder for now, we would remove it

outer vine May 19, 2023, 5:31 AM

#

gusty condor I doubt whether we need concrete examples to demonstrate this part of the limita...

I think you are right. You could leave a comment on the overleaf.

#

But is this a verified conclusion? The linear attention makes RWKV more sensitive to prompt?

gusty condor May 19, 2023, 5:36 AM

#

outer vine But is this a verified conclusion? The linear attention makes RWKV more sensitiv...

I didn't see any concrete demonstration that linear attention to RWKV makes it more sensitive to prompts.
But according to some experiments conducted in the RWKV chat group, RWKV is likely to be sensitive to prompts. We just need several concrete examples.

#

For example:
Prompt 1: Please summarize the following paragraph: <paragraph>

Prompt 2: <paragraph>
Summarize the paragraph above.

gusty condor May 19, 2023, 5:48 AM

#

outer vine I think you are right. You could leave a comment on the overleaf.

Comment added.

fickle hare May 19, 2023, 5:55 AM

#

tender karma Does anyone want to fix this? If not, I can proceed. "RWKV is a large language m...

I would personally describe it as (mostly) parallel training along temporal dimension. Transformer supports such parallelism, while the RNNs with non-linearity in recursion certainly not.

#

IMO such parallelism significantly improve the scalability of training, thus the model parameters

tender karma May 19, 2023, 6:00 AM

#

fickle hare I would personally describe it as (mostly) parallel training along temporal dime...

I’ll add a comment shortly

paper dove May 19, 2023, 6:15 AM

#

fickle hare The Scaling Laws figure (currently Figure 6) seems lossy. May someone plot svg/p...

Scaling Laws figure (currently Figure 6) updated !

tough crane May 19, 2023, 6:20 AM

#

fickle hare I would personally describe it as (mostly) parallel training along temporal dime...

I'm not completely grasping your comment. RWKV recursion seems to contain \sigma(R) term. Do you say that RWKV uses only arithmetic (add, dot-prod, sum, etc) along time?

tough crane May 19, 2023, 6:38 AM

#

paper dove Scaling Laws figure (currently Figure 6) updated !

Could you share a script or notebook to update scaling law plotting?

paper dove May 19, 2023, 6:43 AM

#

sure, the code is based on @rich raptor 's code and modify some plot setting.

#

📎 hack_rwkv_scaling_plots.py

fickle hare May 19, 2023, 7:06 AM

#

tough crane I'm not completely grasping your comment. RWKV recursion seems to contain \sigma...

yes, there is matmul + activation along time (unlike GRU/LSTM)

#

thus the time-sequential part is negligible during temporally parallel training (and for WKV it can be even further parallelized, though unnecessary at this point)

tender karma May 19, 2023, 7:11 AM

#

@fickle hare @burnt gulch @tropic minnow please check my draft attempt. It is not finished but seeking for approval on the direction (it is in the main as well):

Although RWKV is a general recurrent network, its current implementation focuses in the task of language modeling (RWKV-LM). The current implementation exploits a feature that distinguishes RWKV from other RRNs, that is: the channel-mixing block does not require any information from previous states, and thus can be applied in parallel to all time steps, thus greatly reducing the total execution time when the sequence is known in advance (benefiting both the training phase and the processing of the sequence before autoregressive generation). The time-mixing block is also in this sense parallelizable to the computation of \textit{key}, \textit{value}, and \textit{receptance} vectors but then requiring a sequential scan in updating attention scores \textit{wkv}, \textit{aa}, \textit{bb}, \textit{pp}. We call this approach "GPT"-mode as the model's temporal context surpasses the inherently sequential nature of recurrent networks that in theory precludes parallelization.

RWKV equipped with simple a softmax linear projection layer on top allows to build large language models (LLMs) that can be both trained in GPT mode \cite{Radford2019LanguageMA_gpt2} and formulated as an RNN for lighter and faster inference.

#

I think that part of what I am proposing here can be moved effectively in 4.2 Transformer-like Parallelization so to introduce there this "GPT"-mode

fickle hare May 19, 2023, 7:25 AM

#

Maybe it's worth mentioning that the "sequential" scan is elementwise (thus embarrassingly parallel) in batch samples and channels, thus already exposes sufficient parallelism (though not in the time dimension)

tough crane May 19, 2023, 7:30 AM

#

tender karma I think that part of what I am proposing here can be moved effectively in 4.2 Tr...

Could we GPT mode describe as follows? "At first construct the computational graph along the entire time-range (i.e. whole sentences and/or documents ) for i-the layer and then iterate the same construction for the (i+1)-th layer from i=0(the bottom) to d(before LM head)"

If my description is correct, GPT mode could have an alias like "time first graph construction mode".

fickle hare May 19, 2023, 7:37 AM

#

that may miss the point that most computaion along the time-range iteration is in parallel

tough crane May 19, 2023, 7:39 AM

#

fickle hare that may miss the point that most computaion along the time-range iteration is i...

Ah, one might regard the word "time first" as "not parallel along time-axis"

fickle hare May 19, 2023, 7:39 AM

#

yeah that's my worrying

obsidian quest May 19, 2023, 7:43 AM

#

@here The paper looks great now

should we add https://github.com/BlinkDL/RWKV-LM and https://github.com/BlinkDL/ChatRWKV somewhere
Mention https://github.com/ridgerchu/SpikeGPT which shows RWKV is good for Spiking Neural Networks too
Mention that the interpretable and fixed-size RWKV state is beneficial for AGI safety, and we are working on a series of RWKV interpretability & steerability papers
Cases J: add a multi-round chat sample. @uneven blade
Show more time-xxx curves. example: https://www.reddit.com/r/MachineLearning/comments/umq908/r_rwkvv2rnn_a_parallelizable_rnn_with/
Add loss curves to show the training of RWKV is spike-free (https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-loss.png)
Mention RWKV inference only requires gemv (no need for gemm)

tender karma May 19, 2023, 7:44 AM

#

I would not touch the computational graph @tough crane . Indeed the most efficient implementation just execute operations without cg

tough crane May 19, 2023, 8:03 AM

#

tender karma I would not touch the computational graph <@841286386355011594> . Indeed the mos...

Only for training, I think that CG is still built inside pytorch because of back-prop.

tender karma May 19, 2023, 8:06 AM

#

tough crane Only for training, I think that CG is still built inside pytorch because of back...

we have implementations in Rust, C++, Go and also the model_run.py executes operations with numpy without pythorch

fickle hare May 19, 2023, 8:16 AM

#

I'm personally against computation graph on any algorithmic topic since they are just for autograd and performance optimization, unrelated to the model

#

besides, the construction order of computation graph is unrelated to the execution order, and we are talking about execution order in this context

tough crane May 19, 2023, 8:17 AM

#

I see that it's off-topic at here.

fickle hare May 19, 2023, 8:20 AM

#

it's like, we want to say our execution order of both forward and backward is defined by the loop order (layer, t), while loop t is mostly parallel

#

layer by layer, then time-parallel

tough crane May 19, 2023, 8:22 AM

#

fickle hare it's like, we want to say our execution order of both forward and backward is de...

Why we call it "GPT", I was a bit confusing when I read this naming at first.

obsidian quest May 19, 2023, 8:22 AM

#

tough crane Why we call it "GPT", I was a bit confusing when I read this naming at first.

GPT-like mode

fickle hare May 19, 2023, 8:22 AM

#

because decoder only transformer naturally behaves like this

neon night May 19, 2023, 8:23 AM

#

tender karma <@271623916215074816> <@318579274833461248> <@469771066399784971> please check...

Just cite the paper Resurrecting Recurrent Neural Networks for Long Sequences whose main point is precisely that non-linearity activation in the RNN recurrence equation can be removed to enable parallel training

fickle hare May 19, 2023, 8:23 AM

#

and gpt is the representative brand among them thinkies

tough crane May 19, 2023, 8:23 AM

#

fickle hare and gpt is the representative brand among them<:thinkies:872847213657735239>

Ummm, branding...

fickle hare May 19, 2023, 8:24 AM

#

neon night Just cite the paper `Resurrecting Recurrent Neural Networks for Long Sequences` ...

that's another problem though...

fickle hare May 19, 2023, 8:25 AM

#

tough crane Ummm, branding...

uh, I mean, when you want to say "this new mode is like the decoder-only transformers", the first short name come into your mind will be gpt...

neon night May 19, 2023, 8:26 AM

#

pale nexus May 19, 2023, 8:27 AM

#

burnt gulch the idea of GPT mode was sorta confusing the first time when I looked into this....

the GPT mode will predict seq[i] given seq[:i] for all i in one forward pass iirc

fickle hare May 19, 2023, 8:30 AM

#

neon night

that's talking about capability though

#

not the execution order

outer vine May 19, 2023, 8:31 AM

#

update cases template

fickle hare May 19, 2023, 8:32 AM

#

oh I see it mentions a bit about parallelism...

outer vine May 19, 2023, 8:34 AM

#

this is the current template for case study

#

fickle hare May 19, 2023, 8:35 AM

#

maybe a bit smaller inner margin for code blocks?

outer vine May 19, 2023, 8:36 AM

#

i think the final format should be in line with the whole paper, so the research lead should give a final decision by directly changing the first template in Appendix J, and i will help change the rest.

outer vine May 19, 2023, 8:37 AM

#

fickle hare maybe a bit smaller inner margin for code blocks?

sure

tender karma May 19, 2023, 8:37 AM

#

@neon night we need to coordinate a bit to guarantee consistency. we are using rnn mode, gpt mode, parallelization..

neon night May 19, 2023, 8:38 AM

#

@fickle hare He majors in parallel computing. You can coordinate with him about these things.

tender karma May 19, 2023, 8:39 AM

#

Fantastic, thanks

fickle hare May 19, 2023, 8:41 AM

#

thinkies The current terminology on these things throughout RWKV community does mess a lot...

#

GPT mode: During training and prompt preprocessing in inference, we do time-parallel execution for all matmul (stack along the time axis, thus embeddings (B * T, C) @ weight (C, C)), only leaving time-sequential WKV (yet fused in a custom CUDA kernel), making it more bandwidth-effective
RNN mode: During decoding in inference, we do the timesteps one by one, like in Transformer decoding with KV cache, thus not using the custom CUDA kernel for WKV as well

#

Is that clear enough? I don't know if we are to keep the names in the paper though, maybe it's up to @obsidian quest's decision

#

Once we decided that we will need to go through the paper to make it consistent

regal basalt May 19, 2023, 8:47 AM

#

outer vine

this saved my ocd

tough crane May 19, 2023, 8:47 AM

#

fickle hare <:thinkies:872847213657735239> The current terminology on these things throughou...

deep-learning community creates a lot of "fancy" jargons (e.g. hallucination 🤣 is pointed out. It should be called "confabulations" in phycology according to G. Hinton)

obsidian quest May 19, 2023, 8:48 AM

#

fickle hare Is that clear enough? I don't know if we are to keep the names in the paper thou...

we can use parallel & sequential mode to avoid mentioning GPT

fickle hare May 19, 2023, 8:50 AM

#

Then "time-parallel mode" for training and prompt processing, "time-sequential mode" for decoding?

tough crane May 19, 2023, 8:50 AM

#

fickle hare Then "time-parallel mode" for training and prompt processing, "time-sequential m...

Yes, I agree.

tender karma May 19, 2023, 8:52 AM

#

fickle hare Then "time-parallel mode" for training and prompt processing, "time-sequential m...

LGTM!

#

Please check if you'd like to keep part of my content here (I'm okay with you throwing it away):

Although RWKV is a general recurrent network, its current implementation focuses in the task of language modeling (RWKV-LM)\footnote{https://github.com/BlinkDL/RWKV-LM}. The current implementation exploits a feature that distinguishes RWKV from other RRNs, that is: the channel-mixing block does not require any information from previous states, and thus can be applied in parallel to all time steps, thus greatly reducing the total execution time when the sequence is known in advance (benefiting both the training phase and the processing of the sequence before autoregressive generation). The time-mixing block is also in this sense parallelizable to the computation of \textit{key}, \textit{value}, and \textit{receptance} vectors but then requiring a sequential scan in updating attention scores \textit{wkv}, \textit{aa}, \textit{bb}, \textit{pp}. We call this approach "time-parallel"-mode as the model's temporal context surpasses the inherently sequential nature of a recurrent network that in theory precludes parallelization.

fickle hare May 19, 2023, 8:54 AM

#

I'd say it depends on the space budget

#

It makes things much clearer, but maybe not really necessary as @neon night has pointed out that the LRU paper already mentioned that

tender karma May 19, 2023, 8:55 AM

#

in my opinion (but I am biased) we can move part on that on the appropriate time-parallel mode section (replacing the Transformer-like..)

tender karma May 19, 2023, 8:56 AM

#

fickle hare It makes things much clearer, but maybe not really necessary as <@10420353091871...

mention that, but we do that differently. Citing LRU is a must of course

#

seriously, I'm already glad that the point I raised about terminology was also taken up by @obsidian quest. As well that RWKV can be used for but it is not a LM per se. For writing, I know I am long-winded and don't want to force with little space 🙂

fickle hare May 19, 2023, 8:58 AM

#

I don't really see a difference between the parallelization of LRU and RWKV yet; although RWKV started much earlier than LRU publication

tender karma May 19, 2023, 8:59 AM

#

The difference I see is in the underlying motivation that allow that, which is IMO the model's temporal context

#

(w)

#

btw no questioning that the practical point is that non-linearity activation in the RNN recurrence equation can be removed to enable parallel training.

fickle hare May 19, 2023, 9:05 AM

#

This seems to lack context? I'll try update a version to see if it gets better

tough crane May 19, 2023, 9:06 AM

#

@obsidian quest Fixed table4,5 in appendix.

neon night May 19, 2023, 9:11 AM

#

tender karma in my opinion (but I am biased) we can move part on that on the appropriate time...

We are all biased 😆. Really if we can grab a random person and see if he/she can understand the paper, is much better.

#

I think the paper is basically finished but I'm also biased

misty cedar May 19, 2023, 9:26 AM

#

obsidian quest we can use parallel & sequential mode to avoid mentioning GPT

theres also the other parralel mode, if you unchain the time-mixes and wkv function, and have them point to seperate states, you can use the acceleration provided by (BLAS?) to run thousands of rnn threads at the same time, allowing for hyperscaled inference in production enviroments

#

that might be another paper though

fickle hare May 19, 2023, 9:33 AM

#

4.2 and 4.3 updated a lot. Please check if it reads good, thx

#

My Grammarly is not working on Overleaf right now and my English is not so good, so alter the text on your will

pale nexus May 19, 2023, 9:41 AM

#

The Attention Transformer (AFT) (Zhai et al., 2021) replaces dot-product self-attention with a computationally efficient alternative based on factorized attention coefficients that maintains global interactions between inputs and the contex
Should we rather says here that ATF is in fact a multi head attention where 1 feature dimension = 1 head ?

obsidian quest May 19, 2023, 9:53 AM

#

pale nexus > The Attention Transformer (AFT) (Zhai et al., 2021) replaces dot-product self...

AFT is channelwise linear "attention". However it's using the same W for all channels

fickle hare May 19, 2023, 9:54 AM

#

No I didn't?

pale nexus May 19, 2023, 9:54 AM

#

obsidian quest AFT is channelwise linear "attention". However it's using the same W for all ch...

yeah but saying on top of that it looks like a MHA with 1 head = 1 feature dimension can be beneficial to visualize why this works RWKV uses the same principal as well

fickle hare May 19, 2023, 9:55 AM

#

obsidian quest May 19, 2023, 9:55 AM

#

pale nexus yeah but saying on top of that it looks like a MHA with 1 head = 1 feature dime...

yeah you can mention that

fickle hare May 19, 2023, 9:57 AM

#

As it's already the 4th section, we are expected to talk more about the details I think?

neon night May 19, 2023, 9:59 AM

#

Sorry, I was giving it a unfair prompt. Now it says it depends

fickle hare May 19, 2023, 10:03 AM

#

Yes, I was modifying it to match the subsection name better. But basically, they are talking about the same "advantage"...

neon night May 19, 2023, 10:13 AM

#

gusty condor I doubt whether we need concrete examples to demonstrate this part of the limita...

I think this limitation hurts. It's too strong

fickle hare May 19, 2023, 10:14 AM

#

(Is this really the case?)

neon night May 19, 2023, 10:40 AM

#

I think the section title is better to be "Transformer-like Parallelization" and "RNN-like Inference"

fickle hare May 19, 2023, 10:42 AM

#

Maybe "Transformer-like Parallelization in Time" and "RNN-like Sequential Decoding"?

neon night May 19, 2023, 10:45 AM

#

Figure 6 needs to be png or it is loading very slowly 😩

fickle hare May 19, 2023, 10:46 AM

#

Maybe downsample the points?

neon night May 19, 2023, 11:03 AM

#

tender karma The difference I see is in the underlying motivation that allow that, which is I...

Actually IMO LRU doesn't need to be cited. mainly because it is recent paper, we don't have time to compare with them in eval section. If we mentioned LRU, no reason not to compare with it also.

neon night May 19, 2023, 11:11 AM

#

neon night Just cite the paper `Resurrecting Recurrent Neural Networks for Long Sequences` ...

I changed my mind 🙂 normally I would have already added a sentence or two about LRU into overleaf but in this case I prefer doing nothing

tough crane May 19, 2023, 11:20 AM

#

@rich raptor ICould you share Fig 4 csv file and script to plot, if you have them? To re plot with a bit large font according to @mortal latch ‘s comment

tender karma May 19, 2023, 11:22 AM

#

neon night I changed my mind 🙂 normally I would have already added a sentence or two about...

no problem but next time I will condescend less 😇

rich raptor May 19, 2023, 11:33 AM

#

tough crane <@473020070352912384> ICould you share Fig 4 csv file and script to plot, if y...

DMed

neon night May 19, 2023, 11:43 AM

#

tender karma Please check if you'd like to keep part of my content here (I'm okay with you th...

I'm on your side this time. 😇 I suggest you move these words into section 4.2 and change the title from Transformer-like Parallelization to maybe Efficient Parallelization. They are better in section 4.2 than in 4.4. Better to not cite LRU.

#

And I think in case the paper is more than the page limit, the section 6 "Scaling Laws" needs to be moved into appendix.

tender karma May 19, 2023, 12:08 PM

#

neon night I'm on your side this time. 😇 I suggest you move these words into section 4.2 a...

Moved to 4.2.. About title for those sections, I see pro and cons in any of our proposals 🙂

paper dove May 19, 2023, 12:09 PM

#

neon night Figure 6 needs to be png or it is loading very slowly 😩

sure, use pdf just for better resolution. also can change to png with dpi=300, I think it is good enough

neon night May 19, 2023, 12:16 PM

#

tender karma Moved to `4.2.`. About title for those sections, I see pro and cons in any of ou...

In this case I suggest keep the original Transformer-like title. Although it is not exactly like Transformer, but certainly not unlike Transformer 😁

tender karma May 19, 2023, 12:18 PM

#

neon night In this case I suggest keep the original `Transformer-like` title. Although it i...

Agree

paper dove May 19, 2023, 12:18 PM

#

neon night Figure 6 needs to be png or it is loading very slowly 😩

Figure 6 has been updated to png

tender karma May 19, 2023, 12:19 PM

#

This for the section titles, than I think we agreed to call the two modes as "time parallel"-mode and "time-sequential"-mode.

#

This makes a lot of sense to me and we are all happy: consistent and robust names and we also make the connection with transformers and GPT "style"

#

I must say that "is implemented as a simple offset in the temporal dimension at each block implemented in PyTorch \citep{paszke2019pytorch} library as \texttt{nn.ZeroPad2d((0,0,1,-1))}." Respectfully, with this PyTorch code reference, it seems to me a bit randomly thrown in there

#

@neon night @fickle hare did we remove intentionally the "Context" section?

fickle hare May 19, 2023, 12:29 PM

#

(I'm not online when it got removed so..

#

(I don't know what happened to that section

tender karma May 19, 2023, 12:30 PM

#

Again intrinsic bias, but I liked it -reason for removal? if too week it is a good fit for the RNN-style

fickle hare May 19, 2023, 12:30 PM

#

tender karma I must say that "is implemented as a simple offset in the temporal dimension at ...

I do think just say "It's a simple offset & add" would be better

neon night May 19, 2023, 12:50 PM

#

I have to help shorten 4.2 and 4.3 because it's getting longer than I expected again 😅

tender karma May 19, 2023, 12:51 PM

#

tender karma <@1042035309187182622> <@271623916215074816> did we remove intentionally the "Co...

ping @neon night

neon night May 19, 2023, 12:55 PM

#

I don't remove it but people want to, because I don't have enough time to revise every section. I work way slower

tender karma May 19, 2023, 12:58 PM

#

neon night I don't remove it but people want to, because I don't have enough time to revise...

alright soft pushing for reintegrating probably before talking about the RNN-style -but up to you

last mauve May 19, 2023, 1:52 PM

#

Ok everyone, we're reaching the finish line for the v1 arxiv. A few new temporary rules:
1. No major changes without explicit approval by me or @tropic minnow.
2. If you remove anything, it needs to be commented so that it remains in the latex. No more deleting from the latex outright.
3. No new authors will be accepted for the arxiv version

last mauve May 19, 2023, 1:53 PM

#

last mauve Ok everyone, we're reaching the finish line for the v1 arxiv. **A few new tempor...

last mauve May 19, 2023, 1:54 PM

#

last mauve Ok all of these are complete except for improving high-level coherence and the i...

If any authors are looking for things to do, many of these have not been addressed

obsidian quest May 19, 2023, 1:57 PM

#

last mauve Ok all of these are complete except for improving high-level coherence and the i...

And section 7 Inference Experiments - speed & vram

last mauve May 19, 2023, 2:01 PM

#

Yeah @tropic minnow -- what's the status of inference? I'm targeting a monday morning arxiv submission so that it goes live before the EMNLP anonymity deadline

neon night May 19, 2023, 2:05 PM

#

tender karma Again intrinsic bias, but I liked it -reason for removal? if too week it is a go...

The context section is removed by people because it's already mentioned in gist by the last two paragraphs in sec 4.6.
Although I think it is not ripe to mention cross attention there. We don't have equivalent things for cross attention.

young sparrow May 19, 2023, 2:11 PM

#

Heh, I woke up this morning and went “huh, I guess I don’t have any obligations today I could sit down and seriously contribute to the RWKV paper!”

I’ll still do an editing pass and leave my suggestions, and I want to stress that I’m not asking for special treatment. Congrats everyone on the hard work

tender karma May 19, 2023, 2:22 PM

#

neon night The context section is removed by people because it's already mentioned in gist ...

fair enough! thanks

neon night May 19, 2023, 2:23 PM

#

Although I think that part about cross attention is not justified also. 😩 @obsidian quest Does RWKV have capacity to do things similar to what cross attention can do?

tender karma May 19, 2023, 2:25 PM

#

my point is just that, working with the state, the state itself containing the information e.g. of the prompt eliminates the need for cross-attention

#

look at the BART NLI task for zero-shot classification; this is a case where RWKV skip cross attention intrinsically

young sparrow May 19, 2023, 2:28 PM

#

I think it might be a good idea to make a list of such claims / intuitions, remove them from the arXiv version, and add it with real experimental evidence to the EMNLP version

neon night May 19, 2023, 2:29 PM

#

Yes. I think cross attention is very powerful, can do multimodal things like text2image, text2audio. The phrase "eliminates the need for cross-attention" is too strong

young sparrow May 19, 2023, 2:29 PM

#

A lot of papers like this overclaim, and the rigor of our analysis and the scale of the models trained is one of the biggest factors in our favor

#

We can easily train a small CLIP model with RWKV to see what happens though

#

(just not by monday)

tender karma May 19, 2023, 2:30 PM

#

neon night Yes. I think cross attention is very powerful, can do multimodal things like tex...

You're absolutely right

neon night May 19, 2023, 2:30 PM

#

I'll make the claim softer by now, until further investigation

tropic minnow May 19, 2023, 2:38 PM

#

last mauve Yeah <@469771066399784971> -- what's the status of inference? I'm targeting a mo...

makes sense. this is what i have so far. i think it makes the point that RWKV is more efficient for inference. will try to bring the number of tokens generated from 100 to 256 (probs more realistic of chat), and complement with RWKV @7B and 14B. Will do similar plots for memory. My idea is to add a plot like this to main text and tables showing details in appendix. sounds like a plan?

Captura_de_Pantalla_2023-05-19_a_las_16.33.42.png

obsidian quest May 19, 2023, 2:38 PM

#

neon night Although I think that part about cross attention is not justified also. 😩 <@870...

#992372861924823080 message
my idea: I think RWKV can support Encoder-Decoder via this: for each decoder token, use a learned mixture of [previous decoder hidden state] & [encoder final hidden state].

tropic minnow May 19, 2023, 2:39 PM

#

tropic minnow makes sense. this is what i have so far. i think it makes the point that RWKV is...

(sorry the plot needs to be improved, there are some cpu-cuda misplacements, its in the making)

young sparrow May 19, 2023, 2:40 PM

#

tropic minnow makes sense. this is what i have so far. i think it makes the point that RWKV is...

My flaming hot take is that arXiv preprints can be 15 pages if that’s what’s necessary and you can just push things to appendices for the submission. I have no issue reading 15 pages papers if they’re good.

obsidian quest May 19, 2023, 2:40 PM

#

tropic minnow makes sense. this is what i have so far. i think it makes the point that RWKV is...

the RWKV 169M data point seems wrong

tropic minnow May 19, 2023, 2:41 PM

#

obsidian quest the RWKV 169M data point seems wrong

yupp. repeating soon. will post an updated version

#

(modulus rwkv-169, this is roughly the state @100 toks generation. will repeat with 256 for all)

#

Captura_de_Pantalla_2023-05-19_a_las_16.42.36.png

fickle hare May 19, 2023, 2:49 PM

#

is >= 1k possible? that might expose a huge difference

tropic minnow May 19, 2023, 2:51 PM

#

fickle hare is >= 1k possible? that might expose a huge difference

yea idk we can try

regal basalt May 19, 2023, 3:03 PM

#

how to fix the huge space gap NotAmusedCat

fickle hare May 19, 2023, 3:05 PM

#

On the lately updated 4.2, there are still some issues:
a) 4.1 is still mentioning GPT mode, need to get fixed
b) 3.1 overlaps with the new 4.2, need to dedup at either side
c) 4.2 is mostly explaining the fig 1c, so add a ref would be better

outer vine May 19, 2023, 3:07 PM

#

regal basalt how to fix the huge space gap <:NotAmusedCat:589159927356850176>

since we would have more examples, i think there is no need for a perfect arrangement for now

regal basalt May 19, 2023, 3:07 PM

#

alright

fickle hare May 19, 2023, 3:08 PM

#

Also, the current fig 1c does not really demonstrate how channel-mix executes (just a long green box)...

subtle oak May 19, 2023, 3:08 PM

#

outer vine since we would have more examples, i think there is no need for a perfect arrang...

I've tried use the vspace to control it, but does not work

outer vine May 19, 2023, 3:08 PM

#

honestly, i don't understand this figure

#

#

there is not even the explanation for green color

fickle hare May 19, 2023, 3:09 PM

#

outer vine

me neither (if without my pre-knowledge

tropic minnow May 19, 2023, 3:59 PM

#

regal basalt how to fix the huge space gap <:NotAmusedCat:589159927356850176>

tried helping, see the float=h added

tropic minnow May 19, 2023, 4:03 PM

#

outer vine

yea i see. i dont see the point of talking so much about rnns (figure and even putting their equations from papers 20yrs ago) when even the formulation of RWKV as an rnn is in the appendinx, and our own rnn-like equations are in the appendix. i would look at shortening that section and push some content into appendices. curious to see what others think. we could also do it for EMNLP and have it like this on arxiv

outer vine May 19, 2023, 4:08 PM

#

can't agree more

#

imo, a figure like this in AFT paper would help better illustration

#

tropic minnow May 19, 2023, 4:16 PM

#

fickle hare On the lately updated 4.2, there are still some issues: a) 4.1 is still mentioni...

yupp fixed. thx for reporting. could you add comments to latex for other things you might see?

regal basalt May 19, 2023, 4:16 PM

#

tropic minnow tried helping, see the `float=h` added

ty

outer vine May 19, 2023, 4:18 PM

#

and i think the key point we should emphasis would be the wkv formulation and its relation with attention and recurrence. things like token shift, custom cuda kernel, specific implementation like nn.ZeroPad2d((0,0,1,-1)) are like tricks to improve the performance and efficiency. all my personal opinions. curious to see what would you think of this

tropic minnow May 19, 2023, 4:38 PM

#

outer vine and i think the key point we should emphasis would be the wkv formulation and it...

pushing the zeroPad and cuda Kernel to 4.7 Additional Optimizations seems reasonable. will do soon. In parallel, what do you think about shortening a bit the QRNN section in 3. background, perhaps keeping it more high level (removing equations or pushing them to an appendix) . i think we could also expand a bit on 2. Background -> Attention Free Models for the Attention-Free transformer given its parallelism with RWKV time-mixing block

fickle hare May 19, 2023, 4:44 PM

#

+1 on shortening QRNN

outer vine May 19, 2023, 4:57 PM

#

agree

#

personal view, i would expect a picture like this to better show RWKV (apologize for the poor quality of this drawing.)

#

#

(the red line is a equals sign

last mauve May 19, 2023, 5:12 PM

#

young sparrow Heh, I woke up this morning and went “huh, I guess I don’t have any obligations ...

Where are you leaving these suggestions?

young sparrow May 19, 2023, 5:13 PM

#

last mauve Where are you leaving these suggestions?

I haven't done so yet but was going to use the overleaf

young sparrow May 19, 2023, 5:30 PM

#

@paper dove @rich raptor the main argument against using RNNs to my knowledge is this plot from Scaling Laws for Neural Language Models (plus convergence issues?). I think we should have the data to replicate it with Pythia + RWKV? Would that be a light lift to add to the Scaling Laws section?

neon night May 19, 2023, 5:34 PM

#

tropic minnow yea i see. i dont see the point of talking so much about rnns (figure and even p...

That's easy. You shorten the background and move Appendix B to 4.3.

young sparrow May 19, 2023, 5:47 PM

#

young sparrow <@1072058174552686632> <@473020070352912384> the main argument against using RNN...

This could be totally wrong and I’ve never trained an RNN in my life

tough crane May 19, 2023, 5:48 PM

#

outer vine can't agree more

It was rejected to include AFT into the background section when I suggested. At first, templates has section skelton with title RNN(3.1) and Transformers(3.2).

tough crane May 19, 2023, 5:50 PM

#

outer vine honestly, i don't understand this figure

The figure is only for comparison against old RNNs and new RNNs. Similar comparison figure appears in QRNNs paper in ICLR 2017 .

neon night May 19, 2023, 5:50 PM

#

@tender karma Your points about state and cross attention can be added as future work. 😌 and AGI safety

tender karma May 19, 2023, 5:51 PM

#

neon night <@240487524970004491> Your points about state and cross attention can be added a...

I appreciate -do you want me to touch or at this point is just more efficient if you do?

tough crane May 19, 2023, 5:51 PM

#

tropic minnow yupp fixed. thx for reporting. could you add comments to latex for other things ...

Should we replace section 3.1 with AFT instead of RNNs ?

neon night May 19, 2023, 5:54 PM

#

tender karma I appreciate -do you want me to touch or at this point is just more efficient if...

Now I'm going to sleep and I have to deal with 3.1 tomorrow.

outer vine May 19, 2023, 5:55 PM

#

tough crane The figure is only for comparison against old RNNs and new RNNs. Similar compari...

but this figure doesn't even explain the green color ??

tender karma May 19, 2023, 5:55 PM

#

Alright let me take a look

#

Following the pinned note: shall I just write as a comment and then you see, or directly as text?

neon night May 19, 2023, 5:56 PM

#

tough crane It was rejected to include AFT into the background section when I suggested. At ...

I rejected but now it seems background should be about AFT thinkies plan to do it tomorrow

tough crane May 19, 2023, 5:56 PM

#

neon night I rejected but now it seems background should be about AFT <:thinkies:8728472136...

have a good night or day !!

tough crane May 19, 2023, 5:58 PM

#

outer vine but this figure doesn't even explain the green color ??

I will add an explanation about green part. If we replace 3.1 with AFT, then we will just remove it and replace AFT related fig.

neon night May 19, 2023, 5:59 PM

#

tender karma Following the pinned note: shall I just write as a comment and then you see, or ...

@obsidian quest said AGI safety can be added so text

tender karma May 19, 2023, 6:00 PM

#

Perfect, on it

outer vine May 19, 2023, 6:02 PM

#

I think the figure makes it point in the QRNN paper, but personally i don't think this similar one makes much sense in this paper by simply using different color to differentiate QRNN and RWKV

neon night May 19, 2023, 6:02 PM

#

Don't add anywhere except 4.6, where I made a draft for you. Don't make new sections @tender karma

outer vine May 19, 2023, 6:03 PM

#

tough crane May 19, 2023, 6:10 PM

#

@obsidian quest Do you want to include AFT's formulation and figure into the background section?

Possible choices are: (1) replacing 3.1(RNN) with AFT, or (2) adding AFT section into background section 3, or (3) not including (current status).

tropic minnow May 19, 2023, 6:17 PM

#

last mauve All of these items also need handled as well

Done👍

young sparrow May 19, 2023, 6:18 PM

#

young sparrow <@1072058174552686632> <@473020070352912384> the main argument against using RNN...

@tropic minnow do you have insight into this?

neon night May 19, 2023, 6:19 PM

#

tough crane <@870137517020688415> Do you want to include AFT's formulation and figure into...

He wants. AFT inspired this work. Adding is best, we can delete RNN anytime.

tender karma May 19, 2023, 6:24 PM

#

neon night Don't add anywhere except 4.6, where I made a draft for you. Don't make new sect...

Yes sir 😎

tropic minnow May 19, 2023, 6:25 PM

#

young sparrow <@469771066399784971> do you have insight into this?

I think rwkv would behave like transformers in this plot. However i dont have the exact data to reproduce this plot. I think behavior could be inferred from nlp benchmarks and current scaling laws. If someone gets the raw data for this im happy to do the plot.

#

(I dont have test loss by sequence position in test data for rwkv)

young sparrow May 19, 2023, 6:29 PM

#

tropic minnow (I dont have test loss by sequence position in test data for rwkv)

Ah yes. If there isn’t time to compute this before arXiv it’s not a big deal

young sparrow May 19, 2023, 6:57 PM

#

I've made it about half way through the paper, but my editting has been derailed by needing to go find many citations that should be in the paper but aren't. This paper doesn't currently cite:

Pythia
the Pile
the Eval Harnss
OPT
BLOOM
to name a few. You cannot use or compare to other people's work in your paper without citing it. The entire paper needs to be reread with an explicit goal of identifying missing citations.

tropic minnow May 19, 2023, 7:02 PM

#

will be cited

young sparrow May 19, 2023, 7:37 PM

#

I left a bunch of comments, I hope they’re helpful.

obsidian quest May 19, 2023, 7:40 PM

#

tropic minnow I think rwkv would behave like transformers in this plot. However i dont have th...

shown in Figure 5 - should use log(ctxlen) too

young sparrow May 19, 2023, 7:47 PM

#

So the elephant in the room is the scaling laws section. This section is wrong as-is because it follows Kaplan et al’s flawed methodology rather than Hoffman et al’s improved one, and my original plan was to frame this as an initial exploration with more to come. However the more I think about it the less I think these are really the right plots to show anyways.

The exact parameters of the scaling laws are so context-specific that nobody cares what your numbers are in general.
We know that the optimal trade off for tokens to parameters is likely to change (and specifically shift more in favor of tokens) compared to how it currently is but not by how much
“Scaling laws for RNNs” is not a novel or interesting thing, and is in the original scaling laws papers.

Based on these three points, I think that the best thing to do for this paper is probably do the same analysis again (how long did it take?) using Pythia models and plot them on the same axes hopefully this will show no gap, and therefore provide additional evidence of good scaling. If that can’t be done, we can still replicate this plot from the Cerebras GPT paper because we have the Pythia test set loss value

#

To be clear by “replicate this plot” what I mean is take this plot and add Pythia to it

#

But I do think that the explicit scaling laws calculations should be:
a) pushed to the appendix
b) clearly labeled as a Kaplan et al-style experiment that we plan on following up on in the next version of the paper

tropic minnow May 19, 2023, 8:34 PM

#

young sparrow To be clear by “replicate this plot” what I mean is take this plot and add Pythi...

should x-axis in the comparison to pythia feature params instead of flops right?

tropic minnow May 19, 2023, 8:39 PM

#

young sparrow To be clear by “replicate this plot” what I mean is take this plot and add Pythi...

also, have these models (rwkv, pythia) followed the same token count training? bc otherwise the comparison wouldnt be apples to apples right?

young sparrow May 19, 2023, 10:20 PM

#

tropic minnow also, have these models (rwkv, pythia) followed the same token count training? b...

Within 10%… which probably doesn’t matter too much.

You’re right, and ideally it would be exactly 300B tokens but it’s not. If the current evaluation tables use a checkpoint at ~300B tokens (IDK if they do, but this was talked about by BlinkDL at one point) it’s slightly biased against RWKV.

young sparrow May 19, 2023, 10:21 PM

#

tropic minnow should x-axis in the comparison to pythia feature params instead of flops right?

I think FLOPs on the x-axis and params as point labels make more sense, but we can look at both plots and decide.

Acknowledging my last message, if the evals of RWKV are the fully trained model then FLOPs is more justified. If they show mostly trained models that are token matched then params is probably more justified but it’s close

tender karma May 19, 2023, 10:59 PM

#

@obsidian quest @neon night I enjoyed talking about AGI, but here I really let myself go (although reasoned). It is in 4.6. If you use it, well, if you throw it away, I am just fine 🙂

We speculate that exploration of RWKV state-centric designs can enhance AGI safety. The state (or \textit{context}), summarizing past inputs, it might offer not only predictability but also an enhancement in interpretability. Its manipulation can guide behavior and enforce safety. Recurrence supported by temporal "awareness" could lead to stable systems and state-initiated generation may boost computational efficiency\footnote{In language models, initiating generation from the final post-prompt state could obviate prompt reprocessing, thereby bolstering both efficiency and data security.}. Despite challenges in managing high-dimensional states, these promising leads merit further investigation.

obsidian quest May 20, 2023, 3:14 AM

#

tender karma <@870137517020688415> <@1042035309187182622> I enjoyed talking about AGI, but he...

We can show dim(state) = 4 x d_emb x n_layer (namely x, a, b, x for each layer)

paper dove May 20, 2023, 3:15 AM

#

young sparrow <@1072058174552686632> <@473020070352912384> the main argument against using RNN...

I agree. We have discussed before, and comparing models under different parameters would be more convincing. But at that time, it was uncertain whether there were data available from the same test set. RWKV seems to have been evaluated on the pile test set, which makes it possible to directly compare it with Pythia.

paper dove May 20, 2023, 3:16 AM

#

young sparrow So the elephant in the room is the scaling laws section. This section is wrong a...

Do you have the Pythia data(compute vs loss and parameters vs loss)？

obsidian quest May 20, 2023, 3:20 AM

#

RWKV 14B ctx1024

#

📎 14B_ctx1024.txt

paper dove May 20, 2023, 3:25 AM

#

obsidian quest RWKV 14B ctx1024

I am a little bit confusing, is this pythia 14B data or RWKV 14B data?

misty cedar May 20, 2023, 3:26 AM

#

RWKV

#

I suppose

obsidian quest May 20, 2023, 3:26 AM

#

paper dove I am a little bit confusing, is this pythia 14B data or RWKV 14B data?

RWKV

young sparrow May 20, 2023, 3:57 AM

#

paper dove Do you have the Pythia data(compute vs loss and parameters vs loss)？

Pile test set loss for Pythia models:
70M -> 2.504
160M -> 2.186
410M -> 1.971
1B -> 1.845
1.4B -> 1.793
2.8B -> 1.720
6.9B -> 1.626
12B -> 1.582

#

For non-embedding param counts vs model label

#

You’ll need to math FLOPs yourself but it’s easy and there’s a calculator pinned in #scaling-laws if you don’t know how

paper dove May 20, 2023, 4:20 AM

#

young sparrow Pile test set loss for Pythia models: 70M -> 2.504 160M -> 2.186 410M -> 1.971 1...

nice, I will add pythia data and update figure 6 later

neon night May 20, 2023, 5:25 AM

#

young sparrow But I do think that the explicit scaling laws calculations should be: a) pushed ...

I think appendix B can be back to main text if scaling laws calculation is pushed to appendix

young sparrow May 20, 2023, 6:16 AM

#

neon night I think appendix B can be back to main text if scaling laws calculation is pushe...

This is for arXiv, there is no page limit in the main. That said, I think B and F are the oblivious candidates for promotions. B is useful info for many readers and F seems to be a key property of the architecture

neon night May 20, 2023, 6:33 AM

#

Does the Author Contribution section appear on the final paper?

outer vine May 20, 2023, 7:01 AM

#

just out of curiosity, why Johan S. Wind is not the co first author? I learn a lot from his blog, and he wrote the cuda kernel for RWKV

young sparrow May 20, 2023, 7:06 AM

#

neon night Does the Author Contribution section appear on the final paper?

Not in the version submitted for peer review, but yes it’ll be in the paper once it’s accepted

tender karma May 20, 2023, 8:33 AM

#

obsidian quest We can show dim(state) = 4 x d_emb x n_layer (namely x, a, b, x for each layer)

Waiting for @neon night to first check the current content and his opinion of if/where is the best section to talk about investigating more the different meaning of those state vectors, and also if it makes sense for me doing that

tough crane May 20, 2023, 9:21 AM

#

@mortal latch Increased font size in fig:4.

neon night May 20, 2023, 10:00 AM

#

A new 3.2 highlighting AFT 😇 3.2 needs a new title

neon night May 20, 2023, 10:45 AM

#

https://openreview.net/pdf?id=HyUNwulC- found a paper about parallel scan, I think WKV CUDA doesn't use this paper's technique yet? @fickle hare

fickle hare May 20, 2023, 1:40 PM

#

parallel scan is simple but requires an additional sweep over VRAM

#

it's not always helpful

#

I thought about modifying the impl but end up finding that we already have sufficient channels to parallelize

#

if you want to mention that, maybe cite it and say "with longer sequences there is potential ..."

neon night May 20, 2023, 1:42 PM

#

You know this paper right? LRU also uses this technique

fickle hare May 20, 2023, 1:42 PM

#

I don't know this paper but parallel scan is so simple a technique...

neon night May 20, 2023, 1:43 PM

#

I mean parallel scan over the time dimension

fickle hare May 20, 2023, 1:43 PM

#

yeah I don't know this paper in specific before your message but I always knew the linear recurrent/wkv can be parallel over time dimension

neon night May 20, 2023, 1:45 PM

#

https://arxiv.org/pdf/1709.02755.pdf in contrast, this is the paper who do parallelization over channel dimension in linear RNN. I cited this paper. I think we're using precisely this paper's method

fickle hare May 20, 2023, 1:46 PM

#

(always curious why simply applying some well-known implementation technique will produce a paper in the AI research field, since the day of ring-allreduce introduced to distributed training)

#

thinkies

neon night May 20, 2023, 1:48 PM

#

tender karma Waiting for <@1042035309187182622> to first check the current content and his op...

@tender karma based on that we don't really have a page limit on arXiv, I don't have anything particular against this now. Maybe others can revise it from the perspective of interpretability of state.

neon night May 20, 2023, 1:50 PM

#

fickle hare (always curious why simply applying some well-known implementation technique wil...

because this paper knows it and has the word "simple" in its title?

neon night May 20, 2023, 1:53 PM

#

fickle hare if you want to mention that, maybe cite it and say "with longer sequences there ...

accepted. adding reference is always good

fickle hare May 20, 2023, 1:54 PM

#

neon night https://arxiv.org/pdf/1709.02755.pdf in contrast, this is the paper who do paral...

Seems not really the same. The SRU cell has nonlinear in its recurrent path, thus it cannot be further parallelized through time

#

While WKV is possible to get parallelized through time but we simply didn't do that, due to the already sufficient parallelism

#

The current status is the same though.

neon night May 20, 2023, 2:00 PM

#

I have a question, why is the mode called "time-parallel" while it's not really parallel over time thinkies

#

Anyway I wrote this

tropic minnow May 20, 2023, 2:24 PM

#

neon night I have a question, why is the mode called "time-parallel" while it's not really ...

bc you train with all timesteps at the same time

#

anyway i didnt write it

#

yessir

Captura_de_Pantalla_2023-05-20_a_las_16.28.41.png

#

a100 80gb

obsidian quest May 20, 2023, 2:42 PM

#

young sparrow Pile test set loss for Pythia models: 70M -> 2.504 160M -> 2.186 410M -> 1.971 1...

do we have the final training loss for pythia models (20b tokenizer)

fickle hare May 20, 2023, 2:43 PM

#

neon night I have a question, why is the mode called "time-parallel" while it's not really ...

mostly

#

there is really little flops in wkv anyway

paper dove May 20, 2023, 2:48 PM

#

young sparrow Pile test set loss for Pythia models: 70M -> 2.504 160M -> 2.186 410M -> 1.971 1...

The lowest test loss on RWKV is 1.75, while on Pythia, the lowest loss is 1.582. Cerebras paper states, "Pile test loss is crossentropy in nats/token. We correct all crossentropy results for different vocabularies to be comparable to the GPT-2 vocabulary." Is it because of the difference in vocabulary size? If so, direct comparison may not be appropriate. May I ask if you have the uncorrected loss from Pythia?

obsidian quest May 20, 2023, 2:55 PM

#

paper dove The lowest test loss on RWKV is 1.75, while on Pythia, the lowest loss is 1.582....

someone pls test pile loss of the two models

uneven blade May 20, 2023, 3:03 PM

#

Have we got any agreements on naming consistency? For example, using either time mix, Time Mix, Time Mixing, etc. throughout the paper.

tropic minnow May 20, 2023, 3:07 PM

#

obsidian quest someone pls test pile loss of the two models

im happy to run on an a100 - 80gbs if someone has the script

#

whos author of figure 5? could we see it in log scale for y?

#

this

Captura_de_Pantalla_2023-05-20_a_las_17.08.51.png

tough crane May 20, 2023, 3:14 PM

#

I am not completely sure about the following comment:

"Edward Raff: This needs a call-forward that RWKV will have parallels/relation to QRNN's design, otherwise section 3.1 reads very weirdly."

No need to compare QRNNs and RWKV in the context of parallelizing RNNs ?

tough crane May 20, 2023, 3:15 PM

#

tropic minnow this

No sure but the original data is here

📎 CTXLENvsLOSS.xlsx

uneven blade May 20, 2023, 3:28 PM

#

📎 multi-round.txt

#

An example multi-round dialogue that could add to the paper.

tropic minnow May 20, 2023, 3:32 PM

#

tough crane No sure but the original data is here

a bit meh but it shows some improvement even in the final part

Captura_de_Pantalla_2023-05-20_a_las_17.31.40.png

#

📎 plot_ctxlen.py

tropic minnow May 20, 2023, 3:39 PM

#

tough crane I am not completely sure about the following comment: "Edward Raff: This needs ...

i think it is already compared in the qrnn paragraph

young sparrow May 20, 2023, 4:11 PM

#

obsidian quest do we have the final training loss for pythia models (20b tokenizer)

I can do this this evening, but if anyone has a couple GPUs there’s a fun life hack here: if you launch training in GPT-NeoX with a checkpoint that is finished it will spit out the train and test set evals and stop.

mortal latch May 20, 2023, 4:20 PM

#

tropic minnow a bit meh but it shows some improvement even in the final part

Updated in the paper. The y-axis of the original plot was in log scale but it didn't make a large difference.

tropic minnow May 20, 2023, 5:12 PM

#

updated to better represent token-shift

Captura_de_Pantalla_2023-05-20_a_las_19.11.44.png

obsidian quest May 20, 2023, 5:14 PM

#

Please update the "Tell me about ravens." result because I have never seen such bad responses on https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
a better example:

Ravens are large, black birds with a distinctive white head and neck. They are found in most parts of the world, including North America, Europe, Asia, and Australia. Ravens are known for their intelligence and problem-solving abilities. They are also considered to be symbols of death and transformation in many cultures. Ravens are often associated with the afterlife or death because they have been known to eat carrion or even other birds. In some cultures, ravens are also believed to be messengers of the gods or guardians of treasure.

ChatRWKV - a Hugging Face Space by BlinkDL

tropic minnow May 20, 2023, 5:20 PM

#

kk done. examples look quite cool now

outer vine May 20, 2023, 5:20 PM

#

obsidian quest Please update the "Tell me about ravens." result because I have never seen such ...

done

tough crane May 20, 2023, 6:43 PM

#

neon night A new 3.2 highlighting AFT 😇 3.2 needs a new title

Possible title is : "Transformers and its Attention Free Variant"

tropic minnow May 20, 2023, 9:08 PM

#

3.2 Transformers and an Attention Free Variant any reason for equation 8 duplication?

spiral minnow May 20, 2023, 9:09 PM

#

The last 2 paragraphs in Section 4.6 (Harnessing Temporal Structure for Sequential Data Processing) seem like they belong much more in a future work section, or possibly in the conclusion, right?

tropic minnow May 20, 2023, 9:11 PM

#

Peng Zhou, Qihang Zhao, Rui-Jie Zhu, Jiaming Kong, Johan S. Wind, Samuel Arcadinho @bronze frost @snow zealot pls add affiliation to authors section

tender karma May 20, 2023, 9:21 PM

#

spiral minnow The last 2 paragraphs in Section 4.6 (Harnessing Temporal Structure for Sequenti...

I agree but I've been instructed by @neon night to put the content there and not creating new sections. I would rather move the two last paragraphs in a new Future Work section

spiral minnow May 20, 2023, 9:28 PM

#

tender karma I agree but I've been instructed by <@1042035309187182622> to put the content th...

I think because this is just the arxiv deadline it's okay. And looks like Eric ( @tropic minnow ?) is already putting it into a dedicated section.

tender karma May 20, 2023, 9:35 PM

#

perfect and thanks

tropic minnow May 20, 2023, 9:38 PM

#

@mortal latch objections to moving figure 5 to appendix? its currently in scaling laws but i think it illustrates more the long-context side of rwkv rather than scaling?

tropic minnow May 20, 2023, 9:39 PM

#

tropic minnow `Peng Zhou, Qihang Zhao, Rui-Jie Zhu, Jiaming Kong, Johan S. Wind, Samuel Arcadi...

pls tag others if u know their usernames. lets aim at having this done tmrw. want to send to arxiv on monday

mortal latch May 20, 2023, 9:40 PM

#

tropic minnow <@811686696596275220> objections to moving figure 5 to appendix? its currently i...

Sure! No objections.

spiral minnow May 20, 2023, 11:25 PM

#

tropic minnow pls tag others if u know their usernames. lets aim at having this done tmrw. wan...

FYI you're probably aware but it needs to be submitted to Arxiv by 2pm EST on monday in order for Arxiv to post it on Monday night, then we can promote the work on Tuesday and still be ahead of the anonymity deadline

zealous snow May 20, 2023, 11:37 PM

#

Should we add a footnote stating that the order of authors other than the cofirst authors is alphabetical by last name?

#

and can anyone help to add author affiliation?

mortal latch May 21, 2023, 12:41 AM

#

tropic minnow <@811686696596275220> objections to moving figure 5 to appendix? its currently i...

I made a pass and realized that the original figure 5 is an answer to the RQ3 in Section 5 Evaluations. Maybe keeping it in main text?

neon night May 21, 2023, 4:21 AM

#

The appendix about gradient is flawed 😅 let me fix it

gusty condor May 21, 2023, 5:13 AM

#

Suggestion: use log2(context_length).
Also, should the x-axis label be 'Context length' instead of 'Token position'?

mortal latch May 21, 2023, 5:38 AM

#

gusty condor Suggestion: use log2(context_length). Also, should the x-axis label be 'Context...

It was context length before. I can fix it. However, using log(context length) can make the difference between lines a bit hard to tell..

gusty condor May 21, 2023, 5:38 AM

#

Also, the first sentence in the abstract:
Transformers have "revolutionalized" almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length
should be revolutionized, not revolutionalized

gusty condor May 21, 2023, 5:40 AM

#

mortal latch It was context length before. I can fix it. However, using log(context length) c...

I don't think so. The difference is on y-axis, which is not related to the scaling of x-axis.

mortal latch May 21, 2023, 5:45 AM

#

gusty condor I don't think so. The difference is on y-axis, which is not related to the scali...

gusty condor May 21, 2023, 5:47 AM

#

It's really clear.

#

"Context length" not "content length"

#

Also, in the y-axis, "x 10^0" is not necessary

mortal latch May 21, 2023, 5:52 AM

#

It has been changed to base 2

#

See the latest version

pale nexus May 21, 2023, 5:56 AM

#

still missing some authors affiliation

mortal latch May 21, 2023, 5:57 AM

#

I have added the affiliation info in the main text. However, for authors without affiliations, they are affiliated with EleutherAI for now. This information will be updated later.

gusty condor May 21, 2023, 5:58 AM

#

Thanks! Should I add my contribution?
\paragraph{Ruichong Zhang - Tsinghua University} Proofreading and typo corrections; Advices on \ref{fig:ctxlen_rwkv_loss}.

subtle oak May 21, 2023, 8:27 AM

#

Hi all, I add my affiliation institute (20,21), but I found that the space between the author list and abstract is extremely tight and use the vspace command can not solve that, can anyone help to fix it?

tropic minnow May 21, 2023, 8:36 AM

#

zealous snow Should we add a footnote stating that the order of authors other than the cofirs...

Yes we should

tropic minnow May 21, 2023, 8:36 AM

#

zealous snow and can anyone help to add author affiliation?

I will, once all authors affiliations are there

zealous snow May 21, 2023, 8:39 AM

#

ok i fix it

zealous snow May 21, 2023, 8:39 AM

#

subtle oak Hi all, I add my affiliation institute (20,21), but I found that the space betwe...

i fix it

subtle oak May 21, 2023, 8:40 AM

#

zealous snow i fix it

Thanks a lot!

young sparrow May 21, 2023, 8:41 AM

#

zealous snow Should we add a footnote stating that the order of authors other than the cofirs...

It will look best if you add this to the “equal contribution” footnote. Something like \footnote{Equal first authorship. All other authors are listed alphabetically}

#

Also, in cases where people have the same last name, the standard in English is to alphabetize by first name. So the end of the list should go Jian Zhu, Peng Zhu, Rui-Jie Zhu.

#

@obsidian quest I know people said you can put whatever affiliation you want, but listing an “organization” that doesn’t exist will cause confusion because people will try to look it up.

obsidian quest May 21, 2023, 8:51 AM

#

young sparrow <@870137517020688415> I know people said you can put whatever affiliation you wa...

ok can we use RWKV.com which exists

zealous snow May 21, 2023, 8:52 AM

#

young sparrow It will look best if you add this to the “equal contribution” footnote. Somethin...

done

young sparrow May 21, 2023, 8:55 AM

#

obsidian quest ok can we use RWKV.com which exists

It seems like that redirects to GitHub?

#

Is there a reason you don’t want to put either “independent researcher” or “EleutherAI”? I was expecting you to put one of those

obsidian quest May 21, 2023, 8:59 AM

#

young sparrow It seems like that redirects to GitHub?

there will be a landing page very soon

obsidian quest May 21, 2023, 9:00 AM

#

young sparrow Is there a reason you don’t want to put either “independent researcher” or “Eleu...

to promote the nonprofit independent RWKV foundation when it is formed

outer vine May 21, 2023, 9:04 AM

#

zealous snow i fix it

maybe try this? \setlength\titlebox{5.5cm} rather then \small ?

zealous snow May 21, 2023, 9:05 AM

#

outer vine maybe try this? \setlength\titlebox{5.5cm} rather then \small ?

Well, anyway, if you think it's ugly, do your best to improve it

young sparrow May 21, 2023, 9:07 AM

#

obsidian quest to promote the nonprofit independent RWKV foundation when it is formed

When do you expect to create it?

paper dove May 21, 2023, 9:09 AM

#

obsidian quest to promote the nonprofit independent RWKV foundation when it is formed

how to join this foundation

zealous snow May 21, 2023, 9:15 AM

#

By the way, may I ask what is your timeline for scaling RWKV to 100B?

#

and the 1.7T data version

tough crane May 21, 2023, 9:23 AM

#

IMO, training for 100B params could be after 20B(GPT-NeoX), 30B(OPT, LLaMA), 60B(OPT, LLaMA), 70B(Chinchilla)

#

But BlinkDL might have a more agressive plan.

young sparrow May 21, 2023, 9:32 AM

#

We need correct scaling laws studies before making decisions about substantially larger models

tough crane May 21, 2023, 9:40 AM

#

it's like a RWKV version of Pythia.

young sparrow May 21, 2023, 9:41 AM

#

I would love to see RWKV Pythia

tropic minnow May 21, 2023, 9:55 AM

#

obsidian quest there will be a landing page very soon

@obsidian quest anyone you'd like to acknowledge for the compute? Should we add an acknowledgement to the community in the RWKV server?

plucky crypt May 21, 2023, 10:07 AM

#

Hi. I made a few experiments to compare RWKV , ChatGPT and GPT-4. Results are not stunning, but still all of them I have included at the end of the appendix with a comment. This is only a draft so if you would agree to attach this section to the final version of the article I will edit it.

tropic minnow May 21, 2023, 10:30 AM

#

author of Figure 9: Effect of small initialization embedding? can we try having it as EPS or PDF format? so quality is preserved under resize

paper dove May 21, 2023, 11:18 AM

#

young sparrow We need correct scaling laws studies before making decisions about substantially...

what's your plan for scaling laws figures

paper dove May 21, 2023, 11:18 AM

#

tropic minnow author of `Figure 9: Effect of small initialization embedding`? can we try havin...

sure, I am the author of Figure 9

obsidian quest May 21, 2023, 11:27 AM

#

tropic minnow <@870137517020688415> anyone you'd like to acknowledge for the compute? Should w...

yeah acknowledge EAI & Stability for compute & support

obsidian quest May 21, 2023, 11:28 AM

#

plucky crypt Hi. I made a few experiments to compare RWKV , ChatGPT and GPT-4. Results are no...

better compare with open source models too

#

@plucky crypt ok you can include [RWKV-4 w/ GPT prompt] & [RWKV-4 w/ optimized prompt] in Table 6

#

And note that P-tuning can be very effective for RWKV because we can directly tune the full state, and we will do this in follow-up papers

obsidian quest May 21, 2023, 11:40 AM

#

young sparrow We need correct scaling laws studies before making decisions about substantially...

I will finish 0.1~14B RWKV-4 "World" (100 langs) and RWKV-5 first

#

I find even 0.1B RWKV-4 "World" can chat in 100 langs

paper dove May 21, 2023, 11:54 AM

#

obsidian quest I find even 0.1B RWKV-4 "World" can chat in 100 langs

wow, amazing finding. because pile dataset contain 100 langs ?

obsidian quest May 21, 2023, 11:57 AM

#

"World" is using some https://huggingface.co/datasets/oscar-corpus/OSCAR-2301

paper dove May 21, 2023, 12:02 PM

#

tropic minnow author of `Figure 9: Effect of small initialization embedding`? can we try havin...

updated

obsidian quest May 21, 2023, 12:13 PM

#

zealous snow and the 1.7T data version

already training PilePlus https://huggingface.co/BlinkDL/rwkv-4-pileplus

neon night May 21, 2023, 12:20 PM

#

neon night The appendix about gradient is flawed 😅 let me fix it

a ridiculous appendix is on the way.. a trailer

tropic minnow May 21, 2023, 12:24 PM

#

Scientific work published at EMNLP 2023 must comply with the \href{https://www.aclweb.org/portal/content/acl-code-ethics}{ACL Ethics Policy}. We encourage all authors to include an explicit ethics statement on the broader impact of the work, or other ethical considerations after the conclusion but before the references. The ethics statement will not count toward the page limit (8 pages for long, 4 pages for short papers).we can think about this for the EMLP. monday soft deadline is about arxiv

tropic minnow May 21, 2023, 12:33 PM

#

obsidian quest yeah acknowledge EAI & Stability for compute & support

done👍

gusty condor May 21, 2023, 12:57 PM

#

Excuse me, which specific CPU is used in the experiment of Appendix J?

tropic minnow May 21, 2023, 1:05 PM

#

gusty condor Excuse me, which specific CPU is used in the experiment of Appendix J?

@snow zealot ?

tropic minnow May 21, 2023, 1:06 PM

#

gusty condor Excuse me, which specific CPU is used in the experiment of Appendix J?

no ARM for now, just x86. will try arm and AMD gpu experiments for emnlp if possible

obsidian quest May 21, 2023, 1:25 PM

#

better RNN cell graph 🙂 pls update

gusty condor May 21, 2023, 1:35 PM

#

tropic minnow no ARM for now, just x86. will try arm and AMD gpu experiments for emnlp if poss...

There are plenty of x86 CPUs. Intel? AMD?

#

How many cores did it use?

tropic minnow May 21, 2023, 1:50 PM

#

gusty condor How many cores did it use?

@snow zealot

tropic minnow May 21, 2023, 1:55 PM

#

obsidian quest better RNN cell graph 🙂 pls update

done

📎 rwkv_as_rnn_3.drawio

Captura_de_Pantalla_2023-05-21_a_las_15.52.22.png

#

alright i'll make a pass in a few hours to standardize the remaining rough edges. pls make all planned remaining contributions asap.

regal basalt May 21, 2023, 1:59 PM

#

What's the deadline again?

snow zealot May 21, 2023, 2:02 PM

#

Lambda cloud instance with 30 CPU 200 GiB and a A100 with 40gb

snow zealot May 21, 2023, 2:02 PM

#

gusty condor How many cores did it use?

Lambda cloud instance with 30 CPU 200 GiB and a A100 with 40gb

tropic minnow May 21, 2023, 2:07 PM

#

regal basalt What's the deadline again?

we're aiming tmrw for arxiv. EMNLP deadline is mid june

regal basalt May 21, 2023, 2:08 PM

#

tropic minnow we're aiming tmrw for arxiv. EMNLP deadline is mid june

thumbs_up

young sparrow May 21, 2023, 2:11 PM

#

It seems like Appendix F is really important, in that it’s part of what allows us to train RWKV models at large scale. If that’s the case, it should be in the main body

tropic minnow May 21, 2023, 2:15 PM

#

young sparrow It seems like Appendix F is really important, in that it’s part of what allows u...

but it is of little novelty compared to attention free transformer: https://arxiv.org/abs/2105.14103

arXiv.org

An Attention Free Transformer

We introduce Attention Free Transformer (AFT), an efficient variant of
Transformers that eliminates the need for dot product self attention. In an AFT
layer, the key and value are first combined with a set of learned position
biases, the result of which is multiplied with the query in an element-wise
fashion. This new operation has a memory comp...

young sparrow May 21, 2023, 2:19 PM

#

tropic minnow but it is of little novelty compared to attention free transformer: https://arxi...

Are you saying “gradient stability isn’t novel because other models also have stable gradients”?

obsidian quest May 21, 2023, 2:19 PM

#

tropic minnow done

cool pls fix position of [r_t] and color of [sigmoid]. move [sigmoid] slightly rightward
Move [3] and (X) slightly upward

young sparrow May 21, 2023, 2:27 PM

#

obsidian quest cool pls fix position of [r_t] and color of [sigmoid]. move [sigmoid] slightly...

At its core, why does this work and ATF doesn’t?

obsidian quest May 21, 2023, 2:27 PM

#

young sparrow At its core, why does this work and ATF doesn’t?

different w for different channel 2. token-shift

young sparrow May 21, 2023, 2:28 PM

#

obsidian quest 1. different w for different channel 2. token-shift

What does the section on Gradient stability show improvement over? What similar models lack stable gradients?

obsidian quest May 21, 2023, 2:29 PM

#

young sparrow What does the section on Gradient stability show improvement over? What similar ...

the w in AFT can be ill-posed

#

while in RWKV it has to be a simple exponential decay

neon night May 21, 2023, 2:31 PM

#

I think AFT is also stable (if w is chosen properly), we are comparing gradient stability against RNNs

#

a new Appendix F shows that AFT's KV operation is stable

obsidian quest May 21, 2023, 2:32 PM

#

both are much better than usual RNNs

neon night May 21, 2023, 2:33 PM

#

Yes we're not so novel against AFT but the AFT paper doesn't prove stability like us

obsidian quest May 21, 2023, 2:37 PM

#

it's natural to arrive at AFT when we linearize QKV attention - the main contribution of AFT is they find sigmoid[Q] & exp[K] is a great combination

young sparrow May 21, 2023, 2:37 PM

#

obsidian quest the w in AFT can be ill-posed

Can you explain this more? Why does it happen? What precisely does it mean?

obsidian quest May 21, 2023, 2:38 PM

#

I think it wont happen in reality when you train an AFT
AFT is stable. It just has less capacity, so the LM performance is not very good

young sparrow May 21, 2023, 2:40 PM

#

Okay, so Eric’s comments about novelty compared to ATF are irrelevant

neon night May 21, 2023, 2:49 PM

#

We can replace 4.5 by Appendix F. Appendix F is more rigorous than 4.5, just a bit scary

tropic minnow May 21, 2023, 3:04 PM

#

young sparrow Okay, so Eric’s comments about novelty compared to ATF are irrelevant

wdym irrelevant? in appendix F we show AFT-like operations are stable (see #1103039376184852622 message ). so my comment about novelty is about us showing more proof that something others did shows nice properties

#

so its basically this conclusion: #1103039376184852622 message

tropic minnow May 21, 2023, 3:11 PM

#

obsidian quest cool pls fix position of [r_t] and color of [sigmoid]. move [sigmoid] slightly...

like it?

Captura_de_Pantalla_2023-05-21_a_las_17.10.55.png

obsidian quest May 21, 2023, 3:12 PM

#

tropic minnow like it?

perfect if find a slightly brighter color for [sigmoid]

tropic minnow May 21, 2023, 3:14 PM

#

obsidian quest perfect if find a slightly brighter color for [sigmoid]

?😆

Captura_de_Pantalla_2023-05-21_a_las_17.14.39.png

neon night May 21, 2023, 3:19 PM

#

tropic minnow wdym irrelevant? in appendix F we show AFT-like operations are stable (see https...

guys, this topic is very hard to argue. I thought about it whole day

tropic minnow May 21, 2023, 3:22 PM

#

neon night guys, this topic is very hard to argue. I thought about it whole day

i think your conclusion is quite fair

tropic minnow May 21, 2023, 3:24 PM

#

plucky crypt Hi. I made a few experiments to compare RWKV , ChatGPT and GPT-4. Results are no...

can we do this pls: #1103039376184852622 message. like that result around 74% you mention, put it in the table

tropic minnow May 21, 2023, 3:27 PM

#

obsidian quest And note that P-tuning can be very effective for RWKV because we can directly tu...

added in 8. Future work paragraph about potential of model state.

tropic minnow May 21, 2023, 3:34 PM

#

plucky crypt Hi. I made a few experiments to compare RWKV , ChatGPT and GPT-4. Results are no...

for chatgpt, we should probably add a footnote with date accessed/retrieved as its a live changing product

#

@bronze frost TODO: proofreading this is a good moment. pls leave latex comments wherever you find something wrong/(that could be improved)/(that needs details)

#

All, we're approaching the soft deadline for monday. Paper is looking very good. Thanks everyone for your contributions. Now it's about improving those rough edges.

Will do a pass later for standardizing affiliations and author contributions to format specified at section start. Will comment the current ones so information is preserved. Pls make sure information is there.

plucky crypt May 21, 2023, 4:08 PM

#

tropic minnow can we do this pls: https://discord.com/channels/729741769192767510/110303937618...

because of limited time and resources I have now only results for 2 datasets with optimized prompts. How many hours do we have?

tropic minnow May 21, 2023, 4:44 PM

#

plucky crypt because of limited time and resources I have now only results for 2 datasets wit...

12-16 would be reasonable

tropic minnow May 21, 2023, 4:44 PM

#

plucky crypt because of limited time and resources I have now only results for 2 datasets wit...

add the current ones and put empty spaces or - for the rest pls

#

how many would you need

plucky crypt May 21, 2023, 4:51 PM

#

ok, I will try to find good prompts for rest of the datasets and run eperiment, for now I will put -

young sparrow May 21, 2023, 6:57 PM

#

I’m trying to run the Pile test set eval from scratch on Pythia but something seems to be very wrong with the runtime. Going to do some debugging and report back

#

Ah I was using a batch size of 1

#

This seems weirdly low? Pythia 70M

|           Task           |Version|    Metric     | Value  |   |Stderr|
|--------------------------|------:|---------------|-------:|---|------|
|json=train:text:test.jsonl|      0|word_perplexity|133.5446|   |      |
|                          |       |byte_perplexity|  2.0859|   |      |
|                          |       |bits_per_byte  |  1.0607|   |      |

#

@paper dove @obsidian quest how did you compute Pile test loss for RWKV?

tender karma May 21, 2023, 7:24 PM

#

young sparrow I’m trying to run the Pile test set eval from scratch on Pythia but something se...

I was doing the same on 4xRTX A5000+NvLink and got issues with the runtime as well

#

batch size?

young sparrow May 21, 2023, 7:25 PM

#

Pythia-410M

|           Task           |Version|    Metric     | Value |   |Stderr|
|--------------------------|------:|---------------|------:|---|------|
|json=train:text:test.jsonl|      0|word_perplexity|39.8875|   |      |
|                          |       |byte_perplexity| 1.7397|   |      |
|                          |       |bits_per_byte  | 0.7988|   |      |

young sparrow May 21, 2023, 7:37 PM

#

tender karma I was doing the same on 4xRTX A5000+NvLink and got issues with the runtime as w...

Now that I am using BS > 1, it's RWKV that's extremely slow

young sparrow May 21, 2023, 8:15 PM

#

I removed a ton of \vspace commands. Using \vspace is a very crude method for arranging figures. It is a) strongly discouraged in general and b) the absolute last thing you should do on a paper. The removal of over 50 \vspace commands appears to have made no visually obvious changes to the paper

obsidian quest May 21, 2023, 9:38 PM

#

young sparrow Now that I am using BS > 1, it's RWKV that's extremely slow

HF rwkv is still buggy. avoid

#

@young sparrow do you have raw token loss for pythia models

tropic minnow May 21, 2023, 9:45 PM

#

@young sparrow here's the script i used to benchmark time and memory consumption, which downloads wheights from HF and loads using the rwkv pip package. maybe its helpful

📎 inference_time_rwkv.py

tender karma May 21, 2023, 9:49 PM

#

It is! Thank you @tropic minnow

tropic minnow May 21, 2023, 9:50 PM

#

credits to @snow zealot for the development hahah

tropic minnow May 21, 2023, 10:59 PM

#

Should we discuss what happened to the Scaling Laws section? I acknowledge there have been previous objections ( #1103039376184852622 message ) and they are commented now. Any reason?

young sparrow May 21, 2023, 11:36 PM

#

tropic minnow Should we discuss what happened to the `Scaling Laws` section? I acknowledge th...

I commented it out because as I reported previously it is
a) done incorrectly
b) doesn’t seem to provide evidence for any of the paper’s claims (even if it were correct)
c) gives equations and guidance that will mislead people (because they’re incorrect)

#

I’ve spent a lot of time trying to figure out a way to post-hoc correct them and I can’t find one.

#

I think that a) should be disqualifying in and of itself, but even if it’s not then b) and c) seem to refute any alleged usefulness.

#

Let’s do it right and put it in the EMNLP submission. But there’s a basic responsibility to not put incorrect and misleading information in the preprint.

tropic minnow May 21, 2023, 11:49 PM

#

young sparrow I commented it out because as I reported previously it is a) done incorrectly b)...

okay. this is quite a major change. please lets report here for these kinds of modifications

young sparrow May 22, 2023, 12:06 AM

#

tropic minnow okay. this is quite a major change. please lets report here for these kinds of m...

I did. I discussed it with BlinkDL and Quentin ahead of time. I reported that I wanted to do it and had conversations about how it might be rescued with several people.

#

I’m sorry that I didn’t mention it explicitly again when I made the change.

neon night May 22, 2023, 1:16 AM

#

so the conclusion here should change. Also I changed "draw parallelisms" to "draw parallels"

young sparrow May 22, 2023, 1:18 AM

#

neon night so the conclusion here should change. Also I changed "draw parallelisms" to "dra...

We do showcase the scaling

#

We scale the model to 14B params and compare performance with transformers

#

The fact that we don’t derive explicit scaling laws doesn’t mean we don’t showcase scaling

gusty condor May 22, 2023, 1:30 AM

#

The spelling of "behavio(u)r" lacks consistency

#

young sparrow May 22, 2023, 1:31 AM

#

young sparrow To be clear by “replicate this plot” what I mean is take this plot and add Pythi...

This plot doesn't match what I'm getting for Pile test loss? It makes RWKV look worse though

gusty condor May 22, 2023, 1:33 AM

#

gusty condor The spelling of "behavio(u)r" lacks consistency

Corrected to 'behavior'

gusty condor May 22, 2023, 1:58 AM

#

Table 6 is too small, barely identifiable

neon night May 22, 2023, 3:52 AM

#

the same appendix J ends weirdly, almost like abruptly, and maybe the indentation should change

neat heron May 22, 2023, 3:57 AM

#

Been going down a RWKV deep dive recently while scouting for good base models to work with. Great coincedence that there happens to be so much discussion around it at this same time 🙂

#

I honestly think it's doing a big disservice by referring to itself as just an RNN. I feel like the fact that it's ultimately derived from Apples attention free transformer is one of the most interesting aspects but seldomly talked about e_think

#

Maybe the AFT isn't the most flattering aspect, but I think that it's just very interesting and catches the eye to warrant a deeper dive, atleast that's what happened for me

young sparrow May 22, 2023, 4:03 AM

#

neat heron I honestly think it's doing a big disservice by referring to itself as **just** ...

The current paper draft stresses that heavily

#

Have you read it?

neat heron May 22, 2023, 4:03 AM

#

young sparrow Have you read it?

nope, I was trying to scroll up to see if I can find the draft, but so many message lol

young sparrow May 22, 2023, 4:04 AM

#

neat heron nope, I was trying to scroll up to see if I can find the draft, but so many mess...

It’s pinned

neat heron May 22, 2023, 4:04 AM

#

Ok I shall read it now, and i'm happy to hear you guys are stressing that part heavily 🙂

neat heron May 22, 2023, 4:59 AM

#

Gotta go to bed soon so mainly doing slow skimming through the paper, but I'd just like to say that I have one of the most common types of color-blindness, and I approve the colors used for the charts 👍 Very easy for me to distinguish the lines 🙂

#

Overall it's a great looking paper and I love that last couple sentences at the conclusion WICKED
and the fact that it significantly beats ChatGPT in MathQA is seriously impressive, and that's not even the RWKV model trained on 1.7 trillion tokens yet. (or is it?)

uneven blade May 22, 2023, 5:02 AM

#

It's not 🙂

neat heron May 22, 2023, 5:04 AM

#

uneven blade It's not 🙂

So much potential, I wish the paper great success and I'll do a deeper dive on it tomorrow, can't wait to fine-tune some insane models on RWKV-V12-14B once it's fully trained on almost 2T tokens 🔥

gusty condor May 22, 2023, 5:14 AM

#

neat heron Overall it's a great looking paper and I love that last couple sentences at the ...

I think that there might be intrinsic "think-step-by-step" mechanisms in RNNs

neat heron May 22, 2023, 5:16 AM

#

gusty condor I think that there might be intrinsic "think-step-by-step" mechanisms in RNNs

yea e_think

#

Might sound a bit out there in terms of paper discussion, but I saw this mentioned somewhere amongst the HF X Raven announcement a few days ago and found it interesting;
RNN's or atleast the way RWKV does things seems to be more closely mimicking certain aspects of the brain.

#

specifically in terms of the locality vs non-locality aspects (Transformers being more of the ladder, while RWKV and the human brain tend to be more of the former)

#

and then that makes me also think though, I recall some research showing that the hippocampus and learning centers of the human brain actually have a strong thing in common with the transformer architecture, it would be interesting if RWKV turns out to conserve this aspect that has similarities with the hippocampus e_think https://www.quantamagazine.org/how-ai-transformers-mimic-parts-of-the-brain-20220912/

tough crane May 22, 2023, 5:31 AM

#

gusty condor I think that there might be intrinsic "think-step-by-step" mechanisms in RNNs

Do you think RWKV could outperform CoT task's performance against transformers because of its architecture?

obsidian quest May 22, 2023, 6:29 AM

#

we should test https://www.pnas.org/doi/10.1073/pnas.2105646118 one day

gusty condor May 22, 2023, 7:31 AM

#

Not necessarily, RWKV does not look back at previous tokens

gusty condor May 22, 2023, 7:31 AM

#

tough crane Do you think RWKV could outperform CoT task's performance against transformers b...

Not necessarily, RWKV does not look back at previous tokens

neon night May 22, 2023, 8:19 AM

#

@obsidian quest I added this because I think time decay (fig. 9) is inductive bias?

obsidian quest May 22, 2023, 8:21 AM

#

neon night <@870137517020688415> I added this because I think time decay (fig. 9) is induct...

yes but the model will learn similar patterns with simple initialization so the initialization is just to speed up convergence

neon night May 22, 2023, 8:34 AM

#

neon night <@870137517020688415> I added this because I think time decay (fig. 9) is induct...

...but also come from parameter initialization as it speeds up training.

pale nexus May 22, 2023, 8:41 AM

#

While many alternatives Transformers have been proposed with similar , ours is the first to back up those claims with models
What are the proposed alternatives ?

tropic minnow May 22, 2023, 9:47 AM

#

pale nexus > While many alternatives Transformers have been proposed with similar , ours i...

AFT transformer? linformer? qrnn? sparse attention? state space models? most papers mentioned in the 2. Related Work

neon night May 22, 2023, 11:18 AM

#

added more citations in 4.5 about tackling gradient problem in RNN

tender karma May 22, 2023, 11:28 AM

#

neon night added more citations in 4.5 about tackling gradient problem in RNN

lgtm

gusty condor May 22, 2023, 12:08 PM

#

Almost deadline?

tropic minnow May 22, 2023, 12:12 PM

#

gusty condor Almost deadline?

yes

last mauve May 22, 2023, 12:31 PM

#

I'm doing a final pass and submitting to arxiv over the next hour

outer vine May 22, 2023, 12:32 PM

#

Hi, I left two comments yesterday, but they haven't been resolved

#

may i just split this into blocks? it seems not a consecutive dialog flow

#

gusty condor May 22, 2023, 12:41 PM

#

No for reproducibility

last mauve May 22, 2023, 12:42 PM

#

last mauve Ok everyone, we're reaching the finish line for the v1 arxiv. **A few new tempor...

@everyone -- There have been a few new authors added since this deadline:

~~Bartłomiej Koptyra~~
~~Bolun Wang~~
~~Ruichong Zhang~~
~~Stanisław Woz~~

If you're on this list, please DM me and prove that you contributed before the deadline and describe what you did. We want the RWKV community to be authors, but we need to guard against people jumping in before the deadline, adding a comma, and claiming authorship.

If I don't hear back you will be removed.

ionic patio May 22, 2023, 12:43 PM

#

Any more room to contribute?

#

Ah guess not

last mauve May 22, 2023, 12:44 PM

#

ionic patio Any more room to contribute?

As a new contributor? We're full for the arxiv version. Followup papers will need more hands though.

outer vine May 22, 2023, 12:54 PM

#

gusty condor No for reproducibility

are these cases from you? you just type these samples in one dialogue flow?

young sparrow May 22, 2023, 12:54 PM

#

tropic minnow a bit meh but it shows some improvement even in the final part

Does RWKV run with arbitrary context length out of the box, or is there some form of adaptation needed, or what? This experiment has almost no explanation in the paper as to how it was done currently.

tropic minnow May 22, 2023, 1:01 PM

#

young sparrow Does RWKV run with arbitrary context length out of the box, or is there some for...

from first principles yes the architecture should run with any length. in practice, this is the setup #1103039376184852622 message

#

for the model evaluated, i think its base rwkv-4 most likely but would be nice to know more @obsidian quest (data comes from: #1103039376184852622 message)

Discord

Discord - A New Way to Chat with Friends & Communities

Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

obsidian quest May 22, 2023, 1:03 PM

#

young sparrow Does RWKV run with arbitrary context length out of the box, or is there some for...

can run with arbitrary context length out of the box if trained with correct method
but now i am using a lazy method so limited by training ctxlen

gusty condor May 22, 2023, 1:03 PM

#

outer vine are these cases from you? you just type these samples in one dialogue flow?

The last conversation is from me, which is summarization

#

I intentionally set topp=0

#

This is good for reproducibility

young sparrow May 22, 2023, 1:05 PM

#

obsidian quest can run with arbitrary context length out of the box if trained with correct met...

Okay so some important details:

How many tokens was it finetuned with?
What finetuning settings were used?
Is the evals shown in the comparison with Pythia, OPT, etc. done with ctx 1024 or 2048?

obsidian quest May 22, 2023, 1:08 PM

#

young sparrow Okay so some important details: 1. How many tokens was it finetuned with? 2. Wha...

these zeroshot tasks only care abt short ctxlen
the rwkv ctx1k (pile 1-epoch model) has similar numbers

young sparrow May 22, 2023, 1:18 PM

#

obsidian quest these zeroshot tasks only care abt short ctxlen the rwkv ctx1k (pile 1-epoch mod...

I’m not sure which of my questions this is supposed to answer

regal basalt May 22, 2023, 1:23 PM

#

outer vine

Yea sure I was thinking about making this a set of logic questions when I added this lol (but if it’s too cluttered then nah)

obsidian quest May 22, 2023, 1:25 PM

#

young sparrow I’m not sure which of my questions this is supposed to answer

Typical method:

ctx1k -> 2k [10B tokens] -> 4k [till almost-plateau] for 1B5 / 3B
ctx1k -> 2k [10B tokens] -> 4k [10B tokens] -> 6k [10B tokens] -> 8k [till almost-plateau] for 7B / 14B
The zero-shot number are almost unchanged after these.
I computed Pythia numbers with full test samples, and I think all of them are less than 1k tokens.

boreal atlas May 22, 2023, 1:38 PM

#

last mauve @everyone -- There have been a few new authors added since this deadline: - ~~Ba...

@worn bloom and Stanislaw Wozniak along with us (Przemyslaw Kazienko na d Jan Kocon) performed comparative experimental studies on ChatGPT (Appendix J).

#

There is no reference to Fig. 6 (\ref{fig:inference_time}). I would suggest Samuel Arcadinho adding it in Sec. 6.

young sparrow May 22, 2023, 1:53 PM

#

obsidian quest Typical method: * ctx1k -> 2k [10B tokens] -> 4k [till almost-plateau] for 1B5 /...

Do you have a plot showing loss over the course of context length extension training

last mauve May 22, 2023, 1:58 PM

#

boreal atlas <@993484180472209449> and Stanislaw Wozniak along with us (Przemyslaw Kazienko ...

Yep they have both verified with me as well.

#

The paper has been submitted to arxiv.

gusty condor May 22, 2023, 3:08 PM

#

We might be able to see it at 9AM Beijing time tomorrow morning (UTC+8)

obsidian quest May 22, 2023, 3:40 PM

#

young sparrow Do you have a plot showing loss over the course of context length extension trai...

can build from wandb records when i am less busy 😂

obsidian quest May 22, 2023, 3:59 PM

#

last mauve @everyone -- There have been a few new authors added since this deadline: - ~~Ba...

Bolun Wang is sick atm (covid) 🙃

last mauve May 22, 2023, 3:59 PM

#

obsidian quest Bolun Wang is sick atm (covid) 🙃

Can you vouch for his authorship?

obsidian quest May 22, 2023, 4:00 PM

#

last mauve Can you vouch for his authorship?

yes

last mauve May 22, 2023, 4:01 PM

#

obsidian quest yes

Great! Then everyone is accounted for and will keep authorship.

young sparrow May 22, 2023, 4:02 PM

#

@obsidian quest the current paper is set to be announced on arXiv in 8 hours. Do you have a plan regarding a Twitter thread / announcement?

obsidian quest May 22, 2023, 4:02 PM

#

last mauve Can you vouch for his authorship?

Bolun Wang: RuoxinTech

obsidian quest May 22, 2023, 4:03 PM

#

young sparrow <@870137517020688415> the current paper is set to be announced on arXiv in 8 hou...

yes EAI can tweet and I will retweet

young sparrow May 22, 2023, 4:04 PM

#

obsidian quest yes EAI can tweet and I will retweet

Okay, do you want me to write the thread?

obsidian quest May 22, 2023, 4:05 PM

#

young sparrow Okay, do you want me to write the thread?

okay

young sparrow May 22, 2023, 4:13 PM

#

@everyone if you are an author of the paper and are on Twitter, please DM me your Twitter username so I can tag you in the thread when it goes live in six-ish hours.

#

Also, does anyone know what the largest RNN ever trained previous to this is?

torpid token May 22, 2023, 4:23 PM

#

young sparrow Also, does anyone know what the largest RNN ever trained previous to this is?

Including LSTMs?

tough crane May 22, 2023, 4:23 PM

#

young sparrow Also, does anyone know what the largest RNN ever trained previous to this is?

May be ELMO in EMNLP2018??

young sparrow May 22, 2023, 4:24 PM

#

torpid token Including LSTMs?

Sure

young sparrow May 22, 2023, 4:24 PM

#

tough crane May be ELMO in EMNLP2018??

Do you know how big that is?

torpid token May 22, 2023, 4:26 PM

#

100M

#

Iirc

tough crane May 22, 2023, 4:29 PM

#

young sparrow Do you know how big that is?

https://allenai.org/allennlp/software/elmo

93Million

AllenNLP - ELMo — Allen Institute for AI

ELMo is a deep contextualized word representation that models complex word use.

#

Small....

#

All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011.

Ummm.... : 🤨

young sparrow May 22, 2023, 4:36 PM

#

tough crane > All models except for the 5.5B model were trained on the 1 Billion Word Benchm...

The 5.5B here refers to the training corpus size 😂

young sparrow May 22, 2023, 4:59 PM

#

First draft (some image attachments are planned but it's work to appropriately interweave them in discord)

Everyone knows that transformers are synonymous with language modeling at scale… but what if they weren’t? Over the past two years @obsidian quest and team has been hard at work figuring out how to scale RNNs to unprecedented scales. Today we are officially announcing a preprint detailing RWKV: a reinvention of the RNN for the transformer era.

Note that this paper is a work in progress, and its release is forced on up by anonymity deadlines. We are planning on continuing to improve and update the paper (including explicitly deriving scaling laws!) and you can come to the discord server for the latest https://discord.gg/z9SGyZE6EE

Claiming that you can match a transformers’ performance is nothing new, and plenty of other papers put forth that claim. What makes RWKV special is that we actually train models up to 14 billion parameters and show consistently competitive performance with token-matched transformers! As far as we know, the largest previous RNN is two orders of magnitude smaller.

RNNs struggle to scale because of how they parallelize, but making the time decay of each channel data-independent, we are able to parallelize RWKV the same way transformers are during training! After training, it can be used like an RNN for inference.

Our design is largely inspired by the “Attention Free Transformer,” which we realized could be written as an RNN if we use circular matrices as "w" in its formula. AFT alone isn’t able to match GPT’s performance, but inspired by it we continued to make progress on “RNNifying” transformers.

RWKV isn’t without its flaws. While we do approximately match the performance of transformers, our anecdotal experience is that it’s more sensitive to prompts and struggles to incorporate very long range information more than traditional transformers do. We are continuing to work to quantify these phenomena.

Our models are available for download on the @huggingface hub (warning: inference appears to be bugged at time of writing) or you can use our library: https://github.com/BlinkDL/RWKV-LM

[a couple tweets of tags and acknowledgements go here]

tough crane May 22, 2023, 5:07 PM

#

young sparrow The 5.5B here refers to the training corpus size 😂

Log scale... 🤯

chilly niche May 22, 2023, 8:17 PM

#

young sparrow **First draft** (some image attachments are planned but it's work to appropriate...

is "flaws" the right word here? I'm thinking limitations or drawbacks

young sparrow May 22, 2023, 8:21 PM

#

chilly niche is "flaws" the right word here? I'm thinking limitations or drawbacks

I view them as synonymous but if you think others won’t I can use limitations

spiral minnow May 22, 2023, 8:27 PM

#

young sparrow I view them as synonymous but if you think others won’t I can use limitations

I would agree with using limitations instead of flaws. For me, flaw has a connotation of "something is being done incorrectly", rather than just suggesting that this method can be improved

young sparrow May 23, 2023, 12:03 AM

#

@everyone if you are an author of the paper and are on Twitter, please DM me your Twitter username so I can tag you in the thread. You have half an hour or whenever I get around to it, whichever happens second.

sharp sonnet May 23, 2023, 12:08 AM

#

Can anyone find the preprint on arxiv? I thought it should've been out at 8 PM EDT today but I am unable to find it

spiral minnow May 23, 2023, 12:11 AM

#

young sparrow @everyone if you are an author of the paper and are on Twitter, please DM me you...

DM on twitter or discord?

young sparrow May 23, 2023, 12:13 AM

#

spiral minnow DM on twitter or discord?

Discord

last mauve May 23, 2023, 12:15 AM

#

sharp sonnet Can anyone find the preprint on arxiv? I thought it should've been out at 8 PM E...

I don't see it, and it disappeared from my arxiv profile???

#

Is that normal? Maybe it's updating?

young sparrow May 23, 2023, 12:16 AM

#

last mauve I don't see it, and it disappeared from my arxiv profile???

That happened with Pythia and then it appeared like 20 minutes later

last mauve May 23, 2023, 12:17 AM

#

young sparrow That happened with Pythia and then it appeared like 20 minutes later

Why's arxiv gotta play with my heart like that smh

chilly niche May 23, 2023, 12:20 AM

#

arxiv takes a while to update each day

#

if you're impatient you can watch it slowly process in order of arxiv IDs

neat heron May 23, 2023, 12:22 AM

#

last mauve Why's arxiv gotta play with my heart like that smh

arxiv likes playing these games with our emotions pensiveCowboy

#

it's all planned as part of their program to get certain emotional reactions out of authors to train their new emotional sentiment analaysis model they've been working on /s

young sparrow May 23, 2023, 12:36 AM

#

Current list of authors with names replaced with twitter tags if I have it

@BlinkDL_AI @eric_alcaide @QuentinAnthon15

@AlbalakAlon, @SSamDav, Huanqi Cao, Xin Cheng, Michael Chung, @GrellaMatteo, @kranthigv, Xuzheng He, Haowen Hou, Przemysław Kazienko, kocon_jan, Jiaming Kong, Bartłomiej Koptyra, @lazercuber, @SriIpsit, @FerdinandMom, Atsushi Saito, @XiangruTang, Bolun Wang, Johan S. Wind, Stanisław Wózniak, Ruichong Zhang, @ZhangZhenyuan3, Qihang Zhao, @zp_pengzhou, @lukeZhu20, @Rudd80856040

last mauve May 23, 2023, 12:37 AM

#

https://arxiv.org/abs/2305.13048

arXiv.org

RWKV: Reinventing RNNs for the Transformer Era

Transformers have revolutionized almost all natural language processing (NLP)
tasks but suffer from memory and computational complexity that scales
quadratically with sequence length. In contrast, recurrent neural networks
(RNNs) exhibit linear scaling in memory and computational requirements but
struggle to match the same performance as Transfo...

young sparrow May 23, 2023, 12:53 AM

#

https://twitter.com/AiEleuther/status/1660811179239849986?s=20

EleutherAI (@AiEleuther)

Everyone knows that transformers are synonymous with large language models… but what if they weren’t? Over the past two years @BlinkDL_AI and team have been hard at work scaling RNNs to unprecedented scales. Today we are releasing a preprint on our work

https://t.co/Odte55iKPU

paper dove May 23, 2023, 2:08 AM

#

young sparrow @everyone if you are an author of the paper and are on Twitter, please DM me you...

Is it still possible to add a tweet tag now? sadge

neon night May 23, 2023, 2:57 AM

#

young sparrow **First draft** (some image attachments are planned but it's work to appropriate...

The twitter post mentioned the word "circulant matrix", and I realized suddenly https://www.bmvc2021-virtualconference.com/assets/papers/0296.pdf this paper is very similar to our methodology, see the images:

#

our circulant matrix looks like this

paper dove May 23, 2023, 3:32 AM

#

neon night The twitter post mentioned the word "circulant matrix", and I realized suddenly ...

interesting, so all w here are learnable param. and RWKV only one w is learnable param.

neon night May 23, 2023, 3:32 AM

#

what's more, this "spatial shift" technique from the same lab (v1) https://arxiv.org/pdf/2106.07477.pdf (v2) https://arxiv.org/pdf/2108.01072.pdf @obsidian quest you gotta see this, it's crazy. maybe RWKV for image is on the way

paper dove May 23, 2023, 3:33 AM

#

RWKV universe is coming

obsidian quest May 23, 2023, 4:41 AM

#

Table 5 AFT-simple should be 1.046 1.209
I am training L12-D512 rwkv to check test loss

#

Figure 4 x-axis wrong params scale

gusty condor May 23, 2023, 5:08 AM

#

obsidian quest Figure 4 x-axis wrong params scale

Maybe the unit is in billion parameters

outer vine May 23, 2023, 6:23 AM

#

missing Hadamard product here?

tropic minnow May 23, 2023, 6:38 AM

#

outer vine missing Hadamard product here?

Hmm its an element wise between vectors

tough crane May 23, 2023, 6:40 AM

#

neon night The twitter post mentioned the word "circulant matrix", and I realized suddenly ...

An arbitrary permutation can be expressed as the product of disjoint cycles.

Increasing the depth of RWKV networks might lead to the product of disjoint cycles to represent arbitrary permutations and graphs.

tropic minnow May 23, 2023, 6:46 AM

#

Thoughts?

#

tough crane May 23, 2023, 7:01 AM

#

tropic minnow

Thanks!! This comment suggests that MEGA could have two modes.

tropic minnow May 23, 2023, 7:02 AM

#

@subtle oak ^^^

tough crane May 23, 2023, 7:02 AM

#

If I correctly understand.

subtle oak May 23, 2023, 7:03 AM

#

tropic minnow <@883175046376472606> ^^^

Thanks for correcting! So the MEGA also has two more like RWKV haha

#

I actually do not add the MEGA’s space and time complexity, I add the table with Transformer, Performer, Linear Transformer, Reformer and AFT-full🤣

tough crane May 23, 2023, 7:09 AM

#

One of my friends pointed out the missing following reference: https://aclanthology.org/2022.emnlp-main.24.pdf

foggy lake May 23, 2023, 7:10 AM

#

Wow what did i wake up to

#

Their readme is so well filled with awesomeness

tough crane May 23, 2023, 7:11 AM

#

subtle oak I actually do not add the MEGA’s space and time complexity, I add the table with...

I've just modified it to O(cd) but should we expand the complexity order comparison table into two columns: one for training mode and one for inference mode??

subtle oak May 23, 2023, 7:12 AM

#

tough crane I've just modified it to O(cd) but should we expand the complexity order compari...

Thanks for your modifications!

tough crane May 23, 2023, 7:13 AM

#

subtle oak Thanks for your modifications!

Do you suppose that big-O is for inference?

subtle oak May 23, 2023, 7:14 AM

#

tough crane Do you suppose that big-O is for inference?

Hmmm, I think if we separate the inference and training, maybe we need also claim that RWKV has different complexity for training and inference?

subtle oak May 23, 2023, 7:15 AM

#

tough crane Do you suppose that big-O is for inference?

Yeah I guess it is okay

subtle oak May 23, 2023, 7:16 AM

#

subtle oak Hmmm, I think if we separate the inference and training, maybe we need also clai...

Although RWKV still use the recurrent mode for training, but if we need to parallelize it, the complexity will be changed

tough crane May 23, 2023, 7:22 AM

#

subtle oak Although RWKV still use the recurrent mode for training, but if we need to paral...

Ah, are you saying that we could write two rows 1:RWKV(GPT-mode) and 2:RWKV(RNN-mode)?

If RWKV(GPT-mode) inference pass runs in parallel for each layer, it holds O(d) time complexity. Is my understanding right??

#

RWKV(GPT-mode): O(d), O(Td)
RWKV(RNN-mode): current table

subtle oak May 23, 2023, 7:25 AM

#

I think if we use the convolution mode instead of RNN mode, its time complexity will become O(Tlog(T)d) by FFT, and it’s space complexity will be O(Td)

tough crane May 23, 2023, 7:26 AM

#

Oh, I am wrong because of ignoring reducing/merging costs.

subtle oak May 23, 2023, 7:26 AM

#

But now in GPT mode, it still uses the RNN backbone (if you check the CUDA code)

#

So the complexity will become O(Td) and O(Td) I guess...

#

The convolution mode is actually just a theoretical best approach for parallelization

#

So maybe if we mentioned the MEGA's two mode, we also need to claim the mode in RWKV

tough crane May 23, 2023, 7:35 AM

#

subtle oak The convolution mode is actually just a theoretical best approach for paralleliz...

To minimize modification of this big-O table, I modified the caption of the table as follows: from "Complexity comparison" to "Inference complexity comparison". Is it okay for you?

convolution-mode (a.k.a divided conquer mode) could be mentioned in an appendix section as a future TODO.

misty cedar May 23, 2023, 7:36 AM

#

RNN mode is just a subset of gpt mode where the inference batch size is 1

subtle oak May 23, 2023, 7:38 AM

#

tough crane To minimize modification of this big-O table, I modified the caption of the tabl...

Great! It make sense for me

subtle oak May 23, 2023, 7:38 AM

#

misty cedar RNN mode is just a subset of gpt mode where the inference batch size is 1

Yeah this is the relation between them haha

outer vine May 23, 2023, 7:40 AM

#

tropic minnow Hmm its an element wise between vectors

yes element wise. so it is supposed to be a Hadamard product, just like AFT

tough crane May 23, 2023, 7:46 AM

#

EQ 14 is fixed into "\odot"

misty cedar May 23, 2023, 7:49 AM

#

subtle oak Yeah this is the relation between them haha

Hmm, speedy beam search using disconnected gpt mode

outer vine May 23, 2023, 7:51 AM

#

tough crane EQ 14 is fixed into "\odot"

2 odot i believe?

misty cedar May 23, 2023, 7:54 AM

#

outer vine yes element wise. so it is supposed to be a Hadamard product, just like AFT

Why not exp(k + w + log(v))
What's that equal?

e^(k+w+log(v) - k - w)?
Ehh, I am probably missing something,
Would be interesting if it turned out that
WKV = V

tough crane May 23, 2023, 8:00 AM

#

obsidian quest Figure 4 x-axis wrong params scale

Is it correct to be 0, 1, 10 in Billion??

tough crane May 23, 2023, 8:04 AM

#

obsidian quest Figure 4 x-axis wrong params scale

Assumed this log-Billion-scale

obsidian quest May 23, 2023, 8:05 AM

#

tough crane Assumed this log-Billion-scale

yeah now correct

outer vine May 23, 2023, 8:06 AM

#

misty cedar Why not exp(k + w + log(v)) What's that equal? e^(k+w+log(v) - k - w)? Ehh, I a...

i would suggest keep the notation in line with AFT

tough crane May 23, 2023, 8:08 AM

#

obsidian quest yeah now correct

I added the words "in billions" into the caption of Fig4.

neon night May 23, 2023, 8:10 AM

#

tough crane An arbitrary permutation can be expressed as the product of disjoint cycles. In...

as a special case, we can try to adjust the permutation so that prompt ends with question can be answered as well as prompt starts with question (I'm not sure)

tender karma May 23, 2023, 9:08 AM

#

What are the sections to roll back, to improve or add for EMNLP? Limit 8 pages right plus Appendix.

tough crane May 23, 2023, 9:36 AM

#

subtle oak I think if we use the convolution mode instead of RNN mode, its time complexity ...

By the way, I think that FFT can decrease the time complexity into O(d log T) if O(T) operations are executed in parallel for each layer of FFT's divided and conquer recursion. Is it valid for you ??

neon night May 23, 2023, 9:52 AM

#

the FFT optimization is mentioned in footnote 3 I guess, while its faster in theory, in practice O(T) is enough (or not) thinkies

neon night May 23, 2023, 10:39 AM

#

can FFT be useful if we calculate according to this matrix? I think this can mitigate some of RWKV's limitations.
namely, using a circular matrix without causal attention mask for processing prompts to achieve "ring topology" rather than caring about the ordering of the prompt.
just my two cents

fickle hare May 23, 2023, 10:42 AM

#

it has been discussed long ago

#

and is preceded by parallel scan

#

FFT is O(T log T) BTW

#

(O(T) operations in parallel isn't real; you cannot really provide parallelism as large as B*T*C, given that would be millions to billions of elements to compute in parallel)

subtle oak May 23, 2023, 10:51 AM

#

tough crane By the way, I think that FFT can decrease the time complexity into O(d log T) i...

FFT only can accelerate the convolution I think, so if we use the RNN mode (include GPT mode), the complexity could not be decreased

neon night May 23, 2023, 11:09 AM

#

would you be interesting in implementing a CNN inference mode?

neon night May 23, 2023, 11:25 AM

#

a FFT implementation by Jianlin Su 🤔

sullen horizon May 23, 2023, 1:07 PM

#

neon night a FFT implementation by Jianlin Su 🤔

like S4/MEGA

neon night May 23, 2023, 1:12 PM

#

neon night can FFT be useful if we calculate according to this matrix? I think this can mit...

I'm planning to implement a dumb O(T^2) for this thinkies just to see if the result is good

obsidian quest May 23, 2023, 4:41 PM

#

@sullen horizon will add Long Range Arena numbers

karmic tree May 23, 2023, 6:03 PM

#

I was talking with Hugging Face a couple months back about writing a HF blog post explainer for RWKV but have been on paternity leave - is anyone doing that? If not, happy to lead and collab on it!

young sparrow May 23, 2023, 6:04 PM

#

karmic tree I was talking with Hugging Face a couple months back about writing a HF blog pos...

https://huggingface.co/blog/rwkv

Introducing RWKV - An RNN with the advantages of a transformer

karmic tree May 23, 2023, 6:04 PM

#

young sparrow https://huggingface.co/blog/rwkv

heeyyy fantastic 😄

obsidian quest May 23, 2023, 9:36 PM

#

Table 5 @last mauve

AFT-simple should be: train 1.046 // test 1.209 according to AFT paper

L12-D512 RWKV: train 1.010 (w/dropout) // test 1.178
trained with AdamW wd 0.1, dropout 0.1, bsz 16, initial LR 6e-4

fickle hare May 24, 2023, 3:03 AM

#

neon night a FFT implementation by Jianlin Su 🤔

I've tested similar method in pytorch and it exposes significant precision issue

#

brute force exp(n*u) won't really work

misty cedar May 24, 2023, 6:40 AM

#

neon night a FFT implementation by Jianlin Su 🤔

Reminds me of the wkv power triangle implementation

import torch
class wkv_power(torch.nn.Module):
    def __init__(self, dims, T):
        super(wkv_power, self).__init__()
    
        self.register_parameter(
        self.register_buffer("mask", torch.ones(T, T).tril().unsqueeze(-1).to(torch.bool), persistent=False)
        self.register_buffer("tri", ((torch.arange(T).expand(T, T)+1).t() -
            torch.arange(T)).tril().unsqueeze(-1), persistent=False)
    def forward(self, k,v, r):
        vx_kx = (k).exp().unsqueeze(0) .expand(
            2, k.shape[0], k.shape[1]).clone()
        vx_kx[0] *= v
        t = ((self.time_decay.expand(self.T,self.T,-1)*self.tri).exp()*self.mask)
        # vx_kx[0][0] += state[2]
        # vx_kx[1][0] += state[3]
        rza = torch.einsum("rki,jki->rji", vx_kx, t)
        vx_kx *= self.time_first.exp()
        vx_kx += rza
        vx_kx[0] = r*vx_kx[0]
        vx_kx[1] = 1/vx_kx[1]
        wkv = vx_kx.prod(0)
        # state[2] = rza[0][-1]
        # state[3] = rza[1][-1]
        return wkv

tropic minnow May 24, 2023, 8:39 AM

#

obsidian quest Table 5 <@367104793292046338> AFT-simple should be: train 1.046 // test 1.209...

kk added👍 . do we have test bpc for RWKV L=6, D=512?

tropic minnow May 24, 2023, 8:41 AM

#

misty cedar Reminds me of the wkv power triangle implementation ```py import torch class wkv...

how fast is torch einsum compared to reorders and matmuls?

#

(i think it might be faster to use more elementary primitives)

obsidian quest May 24, 2023, 9:00 AM

#

tropic minnow kk added👍 . do we have test bpc for RWKV L=6, D=512?

the original L6 D512 model is seriously overfitting because it's not using wd/dropout

tropic minnow May 24, 2023, 9:27 AM

#

obsidian quest the original L6 D512 model is seriously overfitting because it's not using wd/dr...

right. we can say this. still, if we had the number, i think it could be good to report it. do we have it?

young sparrow May 24, 2023, 9:35 AM

#

@obsidian quest if I want to train a RWKV model of X parameters for Y tokens, do you know how I should set the rest of the h params? Is there an approximate formula?

obsidian quest May 24, 2023, 9:38 AM

#

young sparrow <@870137517020688415> if I want to train a RWKV model of X parameters for Y toke...

usual GPT h params are okay - we can search for better h params & lr schedule

young sparrow May 24, 2023, 9:41 AM

#

obsidian quest usual GPT h params are okay - we can search for better h params & lr schedule

Approximately how many A100-hours does it take to train a model with X params and Y tokens?

#

If you don’t know the number but do know the amount of FLOP/second you get during training we can reverse engineer it

obsidian quest May 24, 2023, 9:48 AM

#

young sparrow Approximately how many A100-hours does it take to train a model with X params an...

RWKV-4 14B BF16 ctxlen4096 = 114K tokens/s on 8x8 A100 80G (ZERO2+CP)

young sparrow May 24, 2023, 9:49 AM

#

What about 2048 context?

obsidian quest May 24, 2023, 9:49 AM

#

same training speed regardless of ctxlen

young sparrow May 24, 2023, 9:50 AM

#

What about a model half the size?does speed increase linearly?

obsidian quest May 24, 2023, 9:50 AM

#

yes

young sparrow May 24, 2023, 9:52 AM

#

Okay, so for every 1B params 1B tokens it takes 34 hours?

#

Does that sound right

#

(It doesn’t to me…)

#

No, that would mean a 1B model trained on the pile would take over a year

#

Did it take 30 days to train the 14B model?

obsidian quest May 24, 2023, 9:55 AM

#

young sparrow May 24, 2023, 9:55 AM

#

What is “Gt/day”? Gigs tokens per day?

obsidian quest May 24, 2023, 9:56 AM

#

here efficiency = (Gt/day) * (B params) / (#A100s)

young sparrow May 24, 2023, 9:57 AM

#

So efficiency = B tokens x B params / A100 / Day

#

That’s exactly the number I was looking for ^_^

obsidian quest May 24, 2023, 9:58 AM

#

probably still have 20% room for optimization

young sparrow May 24, 2023, 10:00 AM

#

So if we want to spend 15 days doing experiments, we have time for 30 (B params) (B tokens) / A100

#

Woah what are you running on 336 A100s right now o.O

obsidian quest May 24, 2023, 10:02 AM

#

young sparrow Woah what are you running on 336 A100s right now o.O

I am tuning PilePlus and training World for 0.1~7B simultaneously

young sparrow May 24, 2023, 10:03 AM

#

So if I assume we can get 64 A100s for scaling laws experiments, we get 2,000 (B tokens) (B params)

young sparrow May 24, 2023, 10:18 AM

#

Okay, can we get all combinations of the following training runs launched @obsidian quest?

Tokens (B): 1, 2, 4, 8, 16, 32
Params (B): 0.025, 0.05, 0.1, 0.2, 0.4, 0.8

#

Should take only 100 A100-days total

#

(Param counts don’t have to be exact, if you give me a list of actual param counts I can adjust the exact token counts to compensate)

obsidian quest May 24, 2023, 10:42 AM

#

we can train on minipile https://arxiv.org/abs/2304.08442 do we have a 20b-tokenized version

arXiv.org

The MiniPile Challenge for Data-Efficient Language Models

The ever-growing diversity of pre-training text corpora has equipped language
models with generalization capabilities across various downstream tasks.
However, such diverse datasets are often too large for academic budgets; hence,
most research on Transformer architectures, training procedures, optimizers,
etc. gets conducted on smaller, homogen...

#

can use the method to generate minipiles of different sizes

tropic minnow May 24, 2023, 12:05 PM

#

Xingjian Du, Leon Derczynski, Bolun Wang pls add your contributions to Appendix A: Author Contributions

young sparrow May 24, 2023, 12:11 PM

#

obsidian quest we can train on minipile https://arxiv.org/abs/2304.08442 do we have a 20b-token...

No, we need to train on the same dataset for each of them. It’s okay that we don’t train in the whole pile, that doesn’t matter

grim linden May 24, 2023, 1:59 PM

#

https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2
RWKV featured in karpathy's talk (at 20:10)

Microsoft Build

Microsoft Build–Join us May 23–25, 2023

Learn to harness what's next for developers with expert speakers and sessions.

young sparrow May 24, 2023, 3:59 PM

#

grim linden https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2 ...

Link seems to be down

last mauve May 24, 2023, 4:09 PM

#

young sparrow Link seems to be down

It wasn't an RWKV-specific shout-out

last mauve May 24, 2023, 4:11 PM

#

young sparrow Link seems to be down

It was this table: #1103039376184852622 message

grim linden May 24, 2023, 4:34 PM

#

yes, he read the list until koala and then stopped

#

sorry for clickbaiting 🫠

spiral minnow May 24, 2023, 4:54 PM

#

Does it bother anybody else that the contributions section isn't in the same order as author list?

fickle hare May 24, 2023, 4:57 PM

#

someone on Zhihu asked
"Why is time complexity of linear transformers said to be O(Td^2)? Do they assume linear transformers use some d^2 kernel functions?"
I don't understand linear transformers so repost here.

burnt gulch May 24, 2023, 5:21 PM

#

grim linden yes, he read the list until koala and then stopped

still a good link anyways!

tropic minnow May 24, 2023, 5:31 PM

#

spiral minnow Does it bother anybody else that the contributions section isn't in the same ord...

okay will reorder

tropic minnow May 24, 2023, 5:35 PM

#

tropic minnow `Xingjian Du, Leon Derczynski, Bolun Wang` pls add your contributions to `Append...

reminder

karmic tree May 24, 2023, 7:42 PM

#

tropic minnow reminder

thanks. did this yesterday, guess they got clobbered, re-adding

tropic minnow May 24, 2023, 8:32 PM

#

fickle hare someone on Zhihu asked "Why is time complexity of linear transformers said to be...

@subtle oak

#

i think the zhihu guy might be right

young sparrow May 25, 2023, 1:18 AM

#

fickle hare someone on Zhihu asked "Why is time complexity of linear transformers said to be...

Yes because they do?

subtle oak May 25, 2023, 1:26 AM

#

Yeah actually I assume that the kernel complexity is d^2...

#

The formula of the linear transformer can be represented by this

#

#

And some papers just multiply K and V as the first, but do not use the kernel, like cosFormer

#

subtle oak May 25, 2023, 1:29 AM

#

subtle oak

I guess they use the same QKV structure, e.g., multiply KV as first and then Q

#

I apologize for the simplification of the complexity analysis, if we need the precise estimation, the complexity need to be replaced with O(Tk^2)

#

but there are some papers use the k=d

#

like cosFormer and Spikformer

#

Maybe we should describe a more general complexity, so we need to use the k?

tough crane May 25, 2023, 4:54 AM

#

Could we re-upload a hot-fixed version to Arxiv?

mortal latch May 25, 2023, 6:47 AM

#

no, it's anon period now. We can only update it after emnlp review

tropic minnow May 25, 2023, 9:43 AM

#

tough crane Could we re-upload a hot-fixed version to Arxiv?

we will as soon as we can

gusty condor May 25, 2023, 9:55 AM

#

https://arxiv.org/pdf/1901.03429.pdf

#

Should we prove the Turing completeness of RWKV?

young sparrow May 25, 2023, 12:40 PM

#

gusty condor Should we prove the Turing completeness of RWKV?

This kind of thing is completely meaningless

#

And that paper in particular is extra meaningless because the proof hinges on an assumption that’s not actually true of transformers

#

If you use their formal model but change arbitrary precision to finite precision it stops working

last mauve May 25, 2023, 1:08 PM

#

tropic minnow we will as soon as we can

If people get all their fixes in within the next 4 hrs I can submit a revision this afternoon

tropic minnow May 25, 2023, 2:20 PM

#

last mauve If people get all their fixes in within the next 4 hrs I can submit a revision t...

what about anonymity #1103039376184852622 message

Captura_de_Pantalla_2023-05-25_a_las_16.20.18.png

last mauve May 25, 2023, 2:23 PM

#

tropic minnow what about anonymity https://discord.com/channels/729741769192767510/11030393761...

I think that a silent update without announcement to fix some errors should be fine, but if people feel differently we can hold off.

tough crane May 25, 2023, 2:27 PM

#

gusty condor Should we prove the Turing completeness of RWKV?

I think that RNNs with constant numbers of parameters and infinite precision are also universal TM...

young sparrow May 25, 2023, 2:28 PM

#

last mauve I think that a silent update without announcement to fix some errors should be f...

I think that this risks desk rejection for little benefit. *CL can be quite anal about this

tropic minnow May 25, 2023, 2:29 PM

#

young sparrow I think that this risks desk rejection for little benefit. *CL can be quite anal...

yes im of the same opinion. i think we should abide strictly by the rules... having it rejected would be quite annoying and we probably gain little

young sparrow May 25, 2023, 2:47 PM

#

BTW, y'all're featured at the top of eleuther.ai 🙂

Screen_Shot_2023-05-25_at_10.47.08_AM.png

sharp sonnet May 25, 2023, 2:49 PM

#

Yes, I agree that we should hold this off until the end of review period

last mauve May 25, 2023, 3:26 PM

#

Ok we'll wait

last mauve May 25, 2023, 6:45 PM

#

Ok so our next work item is the EMNLP deadline on June 23. We need to:

Condense what we have to 8 pages
Tighten up the storyline
Resolve the scaling laws issues that @young sparrow reported

My current thought on a core team for this would be @last mauve, @tropic minnow, @spiral minnow, @zealous snow, @tender karma, @rich raptor, @broken moth since all have enough academic writing experience to lead this rewrite (to clarify, anyone can contribute, but these are the rewrite leads). If you want added to or removed from this list, DM me. Once the core team is finalized by the end of the week, I'm going to start assigning sections and working on the EMNLP version with a new overleaf project.

last mauve May 25, 2023, 6:45 PM

#

last mauve **Ok so our next work item is the EMNLP deadline on June 23. We need to:** - Con...

last mauve May 26, 2023, 12:39 AM

#

EMNLP Overleaf: https://www.overleaf.com/9624387813psbrpbqypjfc

Overleaf, Online LaTeX Editor

An online LaTeX editor that’s easy to use. No installation, real-time collaboration, version control, hundreds of LaTeX templates, and more.

last mauve May 26, 2023, 12:39 AM

#

last mauve EMNLP Overleaf: https://www.overleaf.com/9624387813psbrpbqypjfc

misty cedar May 26, 2023, 1:35 AM

#

Can someone point me to the part of the paper that references the modified wkv forward function to alleviate overflow errors?
Anyone trying to reproduce from scratch is going to run into that.
The unmodified wkv formula only works in float64

steady ether May 26, 2023, 3:08 AM

#

@misty cedar This part?

Key search terms are avoid overflow

Screenshot_2023-05-25_at_11.06.54_PM.png

misty cedar May 26, 2023, 3:24 AM

#

Thanks:)

void quartz May 26, 2023, 6:45 AM

#

One of my friend in SF wants to do a podcast episode on RWKV, specifically to highlight alternatives to transformers

https://www.latent.space/podcast

This is in part, due to the strong positive reception from the paper (and me pestering them on RWKV for weeks)

Anyone interested? They are hosted in SF and prefer to do podcast in person but can be remote.

It is expected to get very technical (time/channel mixing) into how things differ from transformers and the pros and cons (aka the paper)

In overall I do believe it is good exposure for RWKV

(I have asked blink prior to opening up the question here, I also know the host well, and he can prepare in advance the topics so you do not end up surprised or uncomfortable in the podcast)

Latent Space Podcast | swyx | Substack

The podcast by and for AI Engineers! We are the first place over 50k developers hear news and interviews about Software 3.0 - Foundation Models changing every domain in Code Generation, Computer Vision, Data Science, and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Guests from Databricks, Glean, ...

tropic minnow May 26, 2023, 11:53 AM

#

void quartz One of my friend in SF wants to do a podcast episode on RWKV, specifically to hi...

if no one else volunteers i could do this but unfortunately in remote🙏

tropic minnow May 26, 2023, 5:35 PM

#

hey @obsidian quest i could help launching these experiments #1103039376184852622 message on the cluster if you're too busy but i would need the training settings you're using for the other RWKV models

obsidian quest May 26, 2023, 6:37 PM

#

tropic minnow hey <@870137517020688415> i could help launching these experiments https://disco...

ok pls list the experiments you'd like to test

young sparrow May 26, 2023, 6:38 PM

#

obsidian quest ok pls list the experiments you'd like to test

In order to compute scaling laws we need to run these training runs

#

I think that scaling laws would be a big value add to the paper, but we don't currently have the necessary data to do it correctly

tropic minnow May 26, 2023, 7:07 PM

#

obsidian quest ok pls list the experiments you'd like to test

probably these:Tokens (B): 1, 2, 4, 8, 16, 32 Params (B): 0.025, 0.05, 0.1, 0.2, 0.4, 0.8 Should take only 100 A100-days total (Param counts don’t have to be exact, if you give me a list of actual param counts I can adjust the exact token counts to compensate) as referenced in #1103039376184852622 message by @young sparrow

obsidian quest May 26, 2023, 7:19 PM

#

tropic minnow probably these:```Tokens (B): 1, 2, 4, 8, 16, 32 Params (B): 0.025, 0.05, 0.1, 0...

how abt the LR schedule

tropic minnow May 26, 2023, 7:22 PM

#

obsidian quest how abt the LR schedule

hmmm thats why we need to know the settings you're using to train current RWKV hah

obsidian quest May 26, 2023, 7:27 PM

#

tropic minnow hmmm thats why we need to know the settings you're using to train current RWKV h...

My method: const LR_init for 10~20G tokens, then exponential decay to LR_final
I think it's actually fine to use original training data for scaling law, because I am decaying LR faster than cosine-decay

tropic minnow May 26, 2023, 7:30 PM

#

obsidian quest My method: const LR_init for 10~20G tokens, then exponential decay to LR_final I...

nice. are config params here https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py correct for training models similar to current rwkv-4 models on HF?

GitHub

RWKV-LM/train.py at main · BlinkDL/RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

#

actually @obsidian quest i think it would be way easier for everybody (you as well lol) and way more reliable (consistency etc) if you launched the training runs on eleuther cluster. i can patch it if you're too busy but likelihood of an experimental mistake increases a lot lol

young sparrow May 26, 2023, 7:39 PM

#

obsidian quest My method: const LR_init for 10~20G tokens, then exponential decay to LR_final I...

Is this LR decay rate calibrated to the number of training tokens in any way?

obsidian quest May 26, 2023, 7:44 PM

#

i decay LR when the loss decrease rate is below a threshold

young sparrow May 26, 2023, 7:50 PM

#

obsidian quest i decay LR when the loss decrease rate is below a threshold

What is the threshold

obsidian quest May 26, 2023, 7:55 PM

#

I begin the decaying of LR when the loss decrease rate is less than "3e-4 per 40M tokens" - just a random threshold
This happens when the model is trained for 10~20G tokens (more so for larger models)

young sparrow May 26, 2023, 7:57 PM

#

I'm having trouble following what that means. Can you state it explicitly, like it's an algorithm?

#

Is it something like this?

if |loss(step[current]) - loss(step[current - 40M tokens])| < 3e-4:
    lr is decreased by ???

obsidian quest May 26, 2023, 7:59 PM

#

if smoothed(|loss(step[current]) - loss(step[current - 40M tokens])|) < 3e-4:
    begin the exponential decay of LR

young sparrow May 26, 2023, 8:00 PM

#

okay so that's for starting when the decay happens

obsidian quest May 26, 2023, 8:00 PM

#

yeah and it's simple exponential decay after this

#

example (gray = green here)

young sparrow May 26, 2023, 8:01 PM

#

And the decay rate aims to reach the target LR after how many tokens? The size of the remaining dataset?

obsidian quest May 26, 2023, 8:02 PM

#

young sparrow And the decay rate aims to reach the target LR after how many tokens? The size o...

yes size of the remaining dataset

young sparrow May 26, 2023, 8:04 PM

#

This is done manually right? I see the following comments in the code currently:


# By default we are using exponential LR decay.
# Here are my suggestions for training.
# Let's say you are training a L6-D512 model.
# 1) Set lr_init = lr_final = 8e-4. Let it run for some mini-epochs, until you feel like reducing LR.
# 2) Check epoch_save_frequency and make sure the partially-trained model is saved. Ctrl+C to stop the run.
# 3) Set lr_init = 8e-4, lr_final = 1e-5, betas = (0.9, 0.999).
# 4) Set EPOCH_BEGIN & LOAD_MODEL to load the partially-trained model. Continue the training.
# 
# For L12-D768, set lr_init = 6e-4. For L24-D1024, set lr_init = 4e-4. For L24-D2048, set lr_init = 3e-4.

obsidian quest May 26, 2023, 8:04 PM

#

yes manually

#

however i think this is mostly useful for small batchsz training. cosine decay is fine for large batchsz

young sparrow May 26, 2023, 8:05 PM

#

bsz = batch size?

#

I see final_tokens=n_epoch*len(train_dataset)*ctx_len

#

If I want to train for a pre-specified number of tokens and then stop, how do I determine how to change this? So my dataset will have more tokens that I actually use

obsidian quest May 26, 2023, 8:18 PM

#

The best method will be to work out a formula that can provide good LR schedules for any [ParamSz - DataSz - BatchSz] combination
For example, I believe the best LR schedule for a tiny DataSz is [constant LR]

young sparrow May 26, 2023, 8:19 PM

#

How big is "tiny"

obsidian quest May 26, 2023, 8:20 PM

#

several G tokens

young sparrow May 26, 2023, 8:21 PM

#

Where is the LR decay type actually set? I see the initial and final LRs, but where do you set it to exponential decay

obsidian quest May 26, 2023, 8:21 PM

#

around 10~20G tokens for pile models

young sparrow May 26, 2023, 8:21 PM

#

No, where in the code

obsidian quest May 26, 2023, 8:23 PM

#

young sparrow No, where in the code

manual

young sparrow May 26, 2023, 8:24 PM

#

You support warm-up right? So if I wanted to make the switch from linear to exponential happen automatically, I can set the warm-up lr to your preferred constant?

last mauve May 26, 2023, 8:29 PM

#

void quartz One of my friend in SF wants to do a podcast episode on RWKV, specifically to hi...

do they provide any travel assistance?

I'm in seattle for a bit. @tropic minnow -- where are you located?

young sparrow May 26, 2023, 8:33 PM

#

@obsidian quest I've added (extremely hacky) support for automatically switching from constant LR to exponential decay and custom dataset sizing in my fork. Can you see if it runs as anticipated?

https://github.com/StellaAthena/RWKV-LM

obsidian quest May 26, 2023, 8:35 PM

#

young sparrow You support warm-up right? So if I wanted to make the switch from linear to expo...

I dont use warmup (or only 10 steps) - RWKV is very stable

young sparrow May 26, 2023, 8:36 PM

#

obsidian quest I dont use warmup (or only 10 steps) - RWKV is very stable

Right, but I hacked warm-up to do the constant LR-then-decay strategy

obsidian quest May 26, 2023, 8:38 PM

#

ok i think it can work

#

also need to change sampling (dataset.py)

tropic minnow May 26, 2023, 9:00 PM

#

last mauve do they provide any travel assistance? I'm in seattle for a bit. <@469771066399...

huh europe. you're closer if anything

last mauve May 26, 2023, 10:43 PM

#

tropic minnow huh europe. you're closer if anything

How about you'll do it remote unless they support my travel. Sound good?

young sparrow May 26, 2023, 10:44 PM

#

obsidian quest ok i think it can work

Great! If it does it’ll make scaling experiments a lot easier

void quartz May 27, 2023, 12:36 AM

#

last mauve How about you'll do it remote unless they support my travel. Sound good?

Sounds reasonable to me. Will ask

fickle hare May 27, 2023, 6:00 AM

#

another person on Zhihu commented that the receptance gate is a gate for output instead of for forgetting

#

I agree on his opinion toward this, the gate is not even on the time passing route

uneven blade May 27, 2023, 7:11 AM

#

Agreed.

tough crane May 27, 2023, 7:21 AM

#

fickle hare I agree on his opinion toward this, the gate is not even on the time passing rou...

I agree with this. I think that the following paper's method could be regard as \sigma(R_i) = 1.0 in RWKV. To consider an extreme case, if R_i is either 0 or 1, then RWKV choose one of the two: "take" or "skip".

https://arxiv.org/abs/2112.05682

arXiv.org

Self-attention Does Not Need $O(n^2)$ Memory

We present a very simple algorithm for attention that requires $O(1)$ memory
with respect to sequence length and an extension to self-attention that
requires $O(\log n)$ memory. This is in contrast with the frequently stated
belief that self-attention requires $O(n^2)$ memory. While the time complexity
is still $O(n^2)$, device memory rather tha...

tropic minnow May 27, 2023, 7:31 AM

#

fickle hare I agree on his opinion toward this, the gate is not even on the time passing rou...

Yes lets change. Indeed the receptance is taken on the residual track to add things instead of removing

#

I mean nothing stops wkv from being negative but yea the “correct” intuition would be “keeping the negative” then

tropic minnow May 27, 2023, 7:36 AM

#

tough crane I agree with this. I think that the following paper's method could be regard as ...

I dont agree with this reference however. Raabe et el (paper u link) computes attention, and they just do a chunking of the matrix and compute iteratively, but they end up with attention; we dont. They dont do any reduction with equal weights. Imo the most straightforward reference for receptance is MLPMixer vs gMLP

tropic minnow May 27, 2023, 7:36 AM

#

last mauve How about you'll do it remote unless they support my travel. Sound good?

Great

tough crane May 27, 2023, 8:07 AM

#

tropic minnow I dont agree with this reference however. Raabe et el (paper u link) computes at...

Thanks!!

#

In the context of MLPMixer vs gMLP, does R act like a time-decaying parametrized version of "token mixer"?

tropic minnow May 27, 2023, 8:23 AM

#

tough crane In the context of MLPMixer vs gMLP, does R act like a time-decaying parametrize...

not really, R is the gating ("g") in gMLP basically (see from: https://arxiv.org/abs/2105.08050v2) but there's nothing about time in MLPMixer-like models. they were designed for images (or sequences, but without specific time inductive biases)

Captura_de_Pantalla_2023-05-27_a_las_10.22.17.png

tropic minnow May 27, 2023, 11:32 AM

#

@paper dove do you have the code/settings for the small init embedding test?

young sparrow May 28, 2023, 7:43 AM

#

I’ve gotten feedback from a bunch of people that the current explication is too dense and it’s hard to understand why decisions are being made. The best way to make progress on this would be for someone who is very familiar with the architecture and it’s design to flesh out the prose, working in tandem with someone who is less familiar but more experienced with writing. I’m not sure who a good candidate for this would be though.

I also think that having Section 4 reorganized and rewritten by one person would be a big boon to accessibility.

#

@obsidian quest have you been able to run my adapted implementation? If it works we can start scaling laws experiments with much less manual work.

#

While doing the aforementioned modifications to the training code I learned several important details that are not described anywhere in the paper currently. I can add them, though I want to note that I’m approaching the level of contribution where I would like to be included as a coauthor (attn: @obsidian quest @tropic minnow @last mauve)

fickle hare May 28, 2023, 7:55 AM

#

young sparrow I’ve gotten feedback from a bunch of people that the current explication is too ...

For someone "very familiar with the architecture", I believe @uneven blade @neon night and myself should be capable. I'm willing to help but cry and cycle should both write better than me.

#

BTW, what exactly weren't mentioned in the paper?

young sparrow May 28, 2023, 7:57 AM

#

fickle hare For someone "very familiar with the architecture", I believe <@61816061758013441...

I almost feel that a weaker writing ability for the architecture expert is a plus, insofar as it forces good communication between you and someone who is better at writing lol.

young sparrow May 28, 2023, 7:58 AM

#

fickle hare BTW, what exactly weren't mentioned in the paper?

There’s no discussion of the learning rate in the paper, which is pretty problematic as it’s rather non-standard in implementation.

#

The actual trained models also lack the infinite context that the paper claims, per my convo with BlinkDL. If the models don’t have it we shouldn’t claim it even if a “less lazy” (his words, not mine) implementation would have it

fickle hare May 28, 2023, 8:01 AM

#

Oh, I see. If speaking on training stuff, I think there are also some customized data loading order (my_pile_stage, etc.), but I don't think Blink have described that anywhere.

young sparrow May 28, 2023, 8:02 AM

#

There’s also no mention of DeepSpeed or ZeRO in the paper currently

#

Instead there’s a vague “oh this parallelizes easily” assertion

fickle hare May 28, 2023, 8:05 AM

#

As general optimizations on distributed data parallelism, I think just mention them during describing the implementation would be okay

#

Also the gradient checkpointing is implemented via DeepSpeed, but I don't know if Blink has been using it in his pretraining

tropic minnow May 28, 2023, 8:19 AM

#

young sparrow While doing the aforementioned modifications to the training code I learned seve...

im ok with that

young sparrow May 28, 2023, 8:23 AM

#

fickle hare As general optimizations on distributed data parallelism, I think just mention t...

I do too (though it would also be nice to explain specifically why they can’t be used with RNNs as well)

#

The point isn’t that it’s a log of work, simply that it’s important details currently missing

tropic minnow May 28, 2023, 8:25 AM

#

i would say the current manuscript focuses on RWKV as a component used to later build a language model and prove that it is effective for it. if i understand correctly, you want to: add more details/specs about RWKV-LM (learning rate, frameworks, training setup, etc) and unify/harmonize/simplify architecture explanation

tropic minnow May 28, 2023, 8:28 AM

#

young sparrow The actual trained models also lack the infinite context that the paper claims, ...

this is not really true? they dont lack infinite context length. they are just not trained with that. Nothing prevents you from getting a RWKV trained model and start generating sequences of 30K tokens. The problem is that it was not trained with such long sequences, so it might not be very useful. But the good thing about RWKV is there isnt a time dependency in the number of parameters, so the same model can be used for very long or very short sequences, just as an RNN.

young sparrow May 28, 2023, 8:29 AM

#

tropic minnow i would say the current manuscript focuses on RWKV as a component used to later ...

I don’t know what it means for this to not be a paper about RWKV-LM

#

It’s literally a paper about language modeling. That’s the only benchmark used anywhere in the paper and the primary draw

tropic minnow May 28, 2023, 8:30 AM

#

young sparrow I don’t know what it means for this to not be a paper about RWKV-LM

nothing actually. was just describing the current situation and the proposed changes

young sparrow May 28, 2023, 8:37 AM

#

tropic minnow this is not really true? they dont lack infinite context length. they are just n...

Blink said it did? IDK I’m deferring to him

obsidian quest May 28, 2023, 8:53 AM

#

tropic minnow this is not really true? they dont lack infinite context length. they are just n...

yes "they are just not trained with that"

#

someone in RWKV discord trained with 100k ctxlen without issues

#

see https://wandb.ai/nathanwilce/raccoonlongctx #998539369919025212 message

W&B

nathanwilce

Weights & Biases, developer tools for machine learning

young sparrow May 28, 2023, 9:00 AM

#

That’s fine, but the point is that the paper doesn’t justify the claims about infinite sequence length. We can include these models, we can add mathematical arguments, we can add scaling tests. We need to add something though

#

You don’t get to appeal to evidence not introduced in the paper to justify claims made in the paper. The fact that evidence exists somewhere doesn’t make the argument correct.

obsidian quest May 28, 2023, 9:02 AM

#

we can train some very long ctxlen models, or improve the cuda to support infinite ctxlen

fickle hare May 28, 2023, 9:04 AM

#

state chaining kernels + temporal gradient checkpoint would work well enough for any long sequence imo, yet we need to do that training

#

(if we want to claim the infinite sequence feature)

#

the simplest thing to do might be weaken "infinite" to "architectural change is not required for extending sequence length", and demonstrate the result from existing models with different supported seqlen

tropic minnow May 28, 2023, 9:12 AM

#

young sparrow That’s fine, but the point is that the paper doesn’t justify the claims about in...

for the paper, i think it would be enough with a mathematical proof, which is "low cost" as it requires no training nor evaluation. and it is basically the level of knowledge we have. bc we havent trained these super long ctx models and rushing them for the paper doesnt seem ideal.

fickle hare May 28, 2023, 9:17 AM

#

(then next problem would be what is to be proved

tough crane May 28, 2023, 9:31 AM

#

IMHO, even if the word "infinite context length" is deleted in this manuscript, linear order complexity in Table 1 is a selling point.

obsidian quest May 28, 2023, 12:04 PM

#

young sparrow <@870137517020688415> have you been able to run my adapted implementation? If it...

I will try cosine decay (len = data len) first

young sparrow May 28, 2023, 4:41 PM

#

obsidian quest I will try cosine decay (len = data len) first

Can you let me know when it’s launched? Mostly asking because I’m anxious about deadlines 🙂

obsidian quest May 28, 2023, 7:14 PM

#

young sparrow Can you let me know when it’s launched? Mostly asking because I’m anxious about ...

now testing it

last mauve May 28, 2023, 7:42 PM

#

young sparrow While doing the aforementioned modifications to the training code I learned seve...

Added you as an author

obsidian quest May 28, 2023, 7:52 PM

#

https://wandb.ai/blinkdl/RWKV-v4-Scaling L12-D768 1/2/4/8/16/32G tokens

W&B

blinkdl

Weights & Biases, developer tools for machine learning

young sparrow May 28, 2023, 8:06 PM

#

obsidian quest https://wandb.ai/blinkdl/RWKV-v4-Scaling L12-D768 1/2/4/8/16/32G tokens

Do you know why the first run crashed?

obsidian quest May 28, 2023, 8:25 PM

#

young sparrow Do you know why the first run crashed?

these are preempted from time to time

#

each crash = increase about 0.001 loss in early training

#RWKV-papers