#RWKV-papers

1 messages · Page 2 of 1

last mauve
#

Yes or no: Are you "Xiangru Tang" in overleaf?

zealous snow
#

yes

last mauve
tropic minnow
#

added points in contributions section for that and rephrased/shortened mine.

#

created a motivation section after intro with some bullet points. feel free to edit

last mauve
#

Thanks @tropic minnow! I'll take a look at these

#

I just added an author list. Lemme know if I missed anyone.

It's formatted kinda ugly rn. If @zealous snow or anyone wants to take a crack at making it less ugly, feel free.

pale nexus
tropic minnow
pale nexus
#

maybe add "with no approximation involved`. I think this is important because I believe when you scale your model, approximations start to take a lot of importance

#

while here, if there is no approximation, scaling shouldnt be a "problem" (at least you are not limited by your attention calculation)

zealous snow
tropic minnow
neon night
spiral minnow
#

Shouldn't we also have author affiliations?

outer vine
#

is there anyone working on in-context learning examples?

pale nexus
#

i think does @rustic rivet

rustic rivet
#

But this might not be a formal in context learning example, more like a showing off with the statefulness of this LM

outer vine
#

i am not quite sure what kind of ICL examples need in the paper

rustic rivet
#

Microsoft published a paper on why ICL work and they believe it's the attention mechanism inside shifted the attention to mimic a meta optimizer. By tweaking with the attention further, they verified the conjecture some how

#

But for us there is no attention, however we can still few shot, then save the state, as shown in the gist

outer vine
#

is this something we want to put in the paper? I remember @obsidian quest has mentioned this, not sure if any follow-ups

rustic rivet
# rustic rivet Microsoft published a paper on why ICL work and they believe it's the attention ...
#

I am also trying to visualize these inside the state variable, to uncover how timemix and channelmix meta optimized the model, too. However this feels like a follow-up blog instead of the current overleaf manuscript for a major introduction

#

This is as far as I go now

outer vine
#

this is interesting if you could link these two together, meta optimizer in RWKV

rustic rivet
#

Considering the whole meta thingy came from next token prediction training it's really fascinating

outer vine
#

yes, this paper has been accepted as ACL2023 findings

#

and this is also a concurrent work, https://arxiv.org/abs/2212.07677

tough crane
#

By the way, it seems to be a comment in compiled PDF as blue colored chars

last mauve
#

I just wanted to get an author list up so people could add themselves and point out issues before we're deadline-constrained

spiral minnow
outer vine
#

What is your opinion about ICL examples? @last mauve @spiral minnow

chilly niche
#

Is there anything I can help out with? Y'all need some SuperGLUE fine-tuning experiments? :p

spiral minnow
outer vine
#

yes, as required here

#

#1103039376184852622 message

#

here

spiral minnow
#

Yes, I do think some examples of the model output would be very beneficial to the paper. It currently has a lot of quantitative analysis but is lacking qualitative analysis

#

Do we have some example outputs from LAMBADA? It looks like the paper is very nearly full at the moment, so maybe we can add a bunch of example outputs in the appendix, but just highlight 1-2 of them in the main paper.
It would be really good to have some example continuations that demonstrate the key qualities of this model: fluent and coherent text continuations that maintain quality over long contexts

outer vine
#

in that case, maybe this could go beyond ICL examples. I would try if i could find something. I would first put it in the appendix.

#

BTW, do we have a RWKV icon now?

pale nexus
#

blinkdl profile picture lol

young sparrow
#

The kitty cat?

burnt gulch
outer vine
#

cool, do you have the original image? maybe we could put it in the showcase

burnt gulch
#

@outer vine

tender karma
last mauve
spiral minnow
#

For me it's University of California, Santa Barbara

last mauve
spiral minnow
young sparrow
spiral minnow
#

It looks like we're using multiple phrases to refer to the attention used in this work (scalar attention and linear attention). I think it would be a good idea to concentrate on only one of those terms to not confuse readers. I'm not sure why it's referred to as scalar attention though, as far as I can tell it's actually a vector?

last mauve
tropic minnow
#

anyone knows the author of the last 2 sentences in 5.7 Context? overleaf username: kinetical

tropic minnow
#

In the context of LLM applications, injecting the context into the model is equivalent to prompt engineering or p-Tuning(Liu et al., 2022). This feature enables one copy of RWKV to serve multiple domains or purposes with an implementation of state cache, minimizing computation overhead essentially these lines

tender karma
#

@tropic minnow I reviewed the 5.7 Context referencing to the Appendix for details and clarifying the concept in that sentence

broken moth
#

I don't understand the current Table 2. Actually, there are two tables with the same "tab:model_flop_count" label. Is it just a placeholder for the inference results?

mortal latch
#

I made a pass of the article. It seems that the introduction and the motivation overlaps quite a bit. Maybe we should merge then into a more concise section?

neon night
#

I just realized token shift is not exactly a residual connection, but more like the structure of casual convolution in WaveNet 🤯

gusty condor
#

A minor typo: should be "LAMBADA" not "LAMGDA"

neon night
#

A revised section 5.6 is available.

neon night
#

I think my part of work is done. I prefer to use an Eleuther AI affiliation. 😁

fickle hare
#

besides, the caption of Figure 5 lacks information on what kind of test the figure is representing.

tender karma
# neon night A revised section 5.6 is available.

I like it it is more robust for the paper. Still, I think we can maintain some soft statement like "% Intuitively, by assigning each token the dual tasks of (1) aggregating all previous information and (2) predicting the next token, shifted channels can focus on the former task, enhancing information propagation." or so

neon night
tough crane
neon night
tender karma
#

I propose to cut off section 8. To make it effective I would insert comparison graphs for each experimented task but not bringing significant value at the end. I still like the concept of that section, however.

gusty condor
#

Also, excuse me, but I think that the description in the LAMBADA is not accurate enough. AFAIK, there is not "a set of candidate words" or something. LAMBADA is an open cloze where one needs to guess the last word of the target sentence by context, without given any choices.

tropic minnow
fickle hare
#

Is Section 8 unfinished?

tender karma
# fickle hare Is Section 8 unfinished?

If for section 8 you mean Fundamental Experiments, yes it is unfinished as it would take much space to insert graphs comparing to LSTM and GRU without creating significant benefit. I commented it.

#

Please all, in the Author Contributions use labels and not explicit numbers 🙃

fickle hare
#

I mean this one

tender karma
fickle hare
#

@uneven blade would you mind adding a causal trace for the same example using some transformer model, to provide a comparison against the transformer about the information propagation?

#

And is it LAMBADA or LAMBDA? It's renamed to LAMBDA throughout everywhere now, even including the file name acc_lambda.png

pale nexus
#

lambada

fickle hare
#

LAMBDA without A occurs in Section 6, Figure 4 caption, Appendix H, and several labels and file names

tough crane
#

Why are section 2 Motivation and section 1 Introduction separated?

tropic minnow
# fickle hare

yes it is. it needs a plot and a reference to an appendix where a table will capture the numbers

neon night
fickle hare
#

The Scaling Laws figure (currently Figure 6) seems lossy. May someone plot svg/pdf for the three plots?

neon night
#

I think it is better to use the same color scheme as the referenced paper.

fickle hare
#

maybe gather the plotting script and redo all the plots

broken moth
obsidian quest
neon night
#

You can also call it temporal residual connection, I've searched this term and some video AI papers do use this concept.

obsidian quest
# gusty condor

all RWKV models are trained with ctx1024 by default, and then some of them are finetuned to longer ctxlens

Note longer ctxlen usually slightly hurts (!) these benchmark tasks because they only care abt short ctxlens

#

Note long ctx models have seen more tokens (1+ epoch)

    params    LAMBADA    AVERAGE    LAMBADA    PIQA    StoryCloze16    Hellaswag    WinoGrande    arc_challenge    arc_easy    headQA    openbookQA    sciq    triviaQA    ReCoRD    COPA
RWKV-4,ctx1k    3    5.24     57.52%    63.94%    73.72%    70.28%    59.63%    59.43%    31.83%    64.27%    28.74%    37.60%    85.70%    11.07%    80.56%    81.00%
RWKV-4,ctx4k    3    5.25     57.93%    63.96%    74.16%    70.71%    59.89%    59.59%    33.11%    65.19%    28.45%    37.00%    86.50%    11.68%    80.87%    82.00%


    params    LAMBADA    AVERAGE    LAMBADA    PIQA    StoryCloze16    Hellaswag    WinoGrande    arc_challenge    arc_easy    headQA    openbookQA    sciq    triviaQA    ReCoRD    COPA
RWKV-4,ctx1k    14.2    3.81     63.54%    71.05%    77.42%    75.57%    70.24%    62.98%    38.31%    70.71%    32.28%    40.60%    90.10%    24.06%    85.73%    87.00%
RWKV-4,ctx4k    14.2    3.88     63.46%    70.10%    77.64%    75.52%    70.66%    64.17%    38.82%    70.29%    32.35%    40.40%    89.90%    24.42%    85.67%    85.00%
RWKV-4,ctx8k    14.2    3.86     63.71%    70.83%    77.48%    76.06%    70.65%    63.85%    38.99%    70.24%    32.64%    41.80%    90.40%    24.58%    85.67%    85.00%
#

However the 14B ctx8k model seems quite better when interacting with users
This can not be shown in any current benchmark tasks unfortunately

outer vine
#

hi @obsidian quest , do you have some personally preferred cases/examples to be shown in the paper?

obsidian quest
#

@uneven blade has plenty of cool examples

rustic rivet
#

For some reason RWKV is somehow very good with math, especially marking-down things @obsidian quest

outer vine
#

cool, just put here and i will make it on the paper appendix

#

i believe examples with long ctx would be more illuminating

outer vine
rustic rivet
tropic minnow
last mauve
neon night
#

However, I do think the total space of Introduction and Motivation needs to be constrained

#

Base on our title "RWKV: Reinventing RNNs for the Transformer Era", the Introduction part should immediately address aspects like the first coming of RNN and the Transformer Era we are now in.

broken moth
#

I can make a copy of the current intro/motivation somewhere at the end and propose the shorter variant of both without duplicated information

neon night
#

You can also add a contribution part, basically anything that is not introduction and motivation goes into contribution

broken moth
#

Who is atsushi.saito.dec17? I currently work on the introduction/motivation, but I see a lot of changes going on. I am not sure that it is a good idea to remove the names of most recognizable LLMs (GPT-3, GPT-4, ChatGPT, LLaMA) if we want this paper to be easily found on Google Scholar

paper dove
#

@rich raptor

rich raptor
young sparrow
#

Wow this paper is coming along really well, y'all're doing great work.

#

I can go through and do an editing pass, leaving comments and suggestions, later today

tough crane
broken moth
#

I didn't mean citations, but model names, Google indexes more by the paper's content.

For the moment, I found that it would be difficult to separate Motivation from Introduction so directly. It prolongs the content because it is hard to avoid repeating the information from Introduction in Motivation. I added a paragraph before the contribution to include the most important part of Motivation. It can be extra separated, but let me know if you need it done.

regal basalt
#

Um, how many output instances will be presented?

young sparrow
tough crane
outer vine
regal basalt
tough crane
tropic minnow
#

all, pls if you see some issue or conflict or have some suggestion, pls use the comment feature (select a text -> right click -> comment) to provide non-urgent feedback before changing if possible

young sparrow
regal basalt
#

I'll fix the formatting in the cases later

outer vine
#

duplicate

mortal latch
#

For Figure 4, could it be converted to .pdf format with larger font size? Now it is hard to read the texts in the picture. Same for Figure 5, 6 and 8. I don't mind fixing them if anyone can share the plotting script.

obsidian quest
#

In Table 4, we should only compare RWKV 14B with "GPT-level" 14B (which is an interpolation of Pythia and NeoX numbers)

#

In Cases J, show the last 3 samples + a coding sample + a chat sample

#

Mention RWKV-4 tricks to solve exp(k) overflow

#

Figure 6 needs to be vectorized

rustic rivet
obsidian quest
rustic rivet
#

OK let me try to do this in a draft

#

returning the favor for once explaining the difference of 100 line version and 150 line version kindly

obsidian quest
#

How's the RWKV scaling law comparing with GPT

last mauve
#

Ok all of these are complete except for improving high-level coherence and the inference results (@tropic minnow -- Daily check-in here. Can we get these by Friday or Saturday?).

Some new small work items:
1. Figures 4-6 are too late by a page. Can we bring these up closer to their content?
2. Most people haven't included their affiliations to their contributor appendix section (e.g. "Affiliation: EleutherAI"). If you don't have an organization, university, or company that you'd like to link to this work, you're welcome to put EleutherAI. PLEASE GET THIS DONE WITHIN THE NEXT TWO DAYS. Also, if you contributed to this work and haven't put your contribution section, do so within the next two days. If you forget, we won't be able to add your name to the arxiv release until after the EMNLP double-blind deadline.
3. The first-page author block needs affiliations added. If someone could take care of that it'd help
4. The contributions need numbered, and the section describing each contribution should be added to the list item.
5. New paragraphs are all indented. Someone needs to go through and add \noindent or something to remove these.
6. Minor nit, but I think Figure 8 is low resolution?

last mauve
obsidian quest
#

Bo PENG: built RWKV and scaled it from 0.1B to 14B.
Affiliation: can I write my github link 😉

tropic minnow
last mauve
last mauve
tough crane
obsidian quest
tough crane
#

Does anyone know the person who made Fig4??

#

Probably BlinkDL paste the Fig5's spread sheet in this channel. Where??

obsidian quest
#

We need to give more credit to Attention Free Transformer because it is an inspiration of RWKV.

obsidian quest
rustic rivet
#

@obsidian quest I rewrote your recursion into formula as in equation 27 to 33, can you confirm I didn't miss anything?

obsidian quest
last mauve
mortal latch
last mauve
obsidian quest
#

AFT: introduces the sigmoid gate (called receptance in RWKV) in linear attention

#

and the sum(exp(K) V) / sum(exp(K)) formulation

tropic minnow
tropic minnow
rustic rivet
#

@obsidian quest I re-read your shared png and realized that I got it wrong the first time, here is a correction:

#

the starting point for the recursion is:

obsidian quest
#

yeah now it's correct

rustic rivet
#

compared to your shared new RNN formula, the only difference is that in your PNG the sign for w is positive, and in here it's negative (I followed the notations earlier in the paper)

#

Cool

#

It's in the paper now, in Appendix B right after introducing the RNN cell.

obsidian quest
last mauve
#

@obsidian quest -- Can you do a pass and make sure there are no technical errors in any figures/equations?

rustic rivet
#

I just put the equations I just added into the huggingface link for a code implementation, damn

obsidian quest
#

yeah this will be a cool example

rustic rivet
#

Saved

#

I didn't even have to cherry pick

#

and it just converted latex into torch

grim linden
#

I have a quick newbie question out of curiosity: can RWKV be seen as an instance of a GNN

rustic rivet
#

hmm no. RWKV don't featurize "vertices" nor "edges" and it doesn't have very strong locality inductive bias as typical GNNs

tender karma
#

Does anyone want to fix this? If not, I can proceed. "RWKV is a large language model (LLM) architecture that can be both trained in GPT mode \cite{Radford2019LanguageMA_gpt2} and formulated as an RNN for lighter and faster inference."

#

RWKV enables the development of LLM, it is not a LLM per se

#

In addition, this "GPT" mode is something we understand because it is in our own project vocabulary... do you think is clear said that way?

tropic minnow
burnt gulch
misty cedar
#

gpt mode also allows for building the state from a set of tokens in one forward pass

gusty condor
#

This response is incomplete and contains little information.

#

Should it be removed?

#

Also, PIQA, which stands for "Physical Interaction: Question Answering", should be totally capitalized, "PIQA" not "PiQA".

gusty condor
#

I doubt whether we need concrete examples to demonstrate this part of the limitation. What do carefully designed prompts look like? How do responses vary by different prompts?

outer vine
outer vine
#

But is this a verified conclusion? The linear attention makes RWKV more sensitive to prompt?

gusty condor
#

For example:
Prompt 1: Please summarize the following paragraph: <paragraph>

Prompt 2: <paragraph>
Summarize the paragraph above.

fickle hare
#

IMO such parallelism significantly improve the scalability of training, thus the model parameters

paper dove
tough crane
tough crane
paper dove
#

sure, the code is based on @rich raptor 's code and modify some plot setting.

fickle hare
#

thus the time-sequential part is negligible during temporally parallel training (and for WKV it can be even further parallelized, though unnecessary at this point)

tender karma
#

@fickle hare @burnt gulch @tropic minnow please check my draft attempt. It is not finished but seeking for approval on the direction (it is in the main as well):

Although RWKV is a general recurrent network, its current implementation focuses in the task of language modeling (RWKV-LM). The current implementation exploits a feature that distinguishes RWKV from other RRNs, that is: the channel-mixing block does not require any information from previous states, and thus can be applied in parallel to all time steps, thus greatly reducing the total execution time when the sequence is known in advance (benefiting both the training phase and the processing of the sequence before autoregressive generation). The time-mixing block is also in this sense parallelizable to the computation of \textit{key}, \textit{value}, and \textit{receptance} vectors but then requiring a sequential scan in updating attention scores \textit{wkv}, \textit{aa}, \textit{bb}, \textit{pp}. We call this approach "GPT"-mode as the model's temporal context surpasses the inherently sequential nature of recurrent networks that in theory precludes parallelization.

RWKV equipped with simple a softmax linear projection layer on top allows to build large language models (LLMs) that can be both trained in GPT mode \cite{Radford2019LanguageMA_gpt2} and formulated as an RNN for lighter and faster inference.

#

I think that part of what I am proposing here can be moved effectively in 4.2 Transformer-like Parallelization so to introduce there this "GPT"-mode

fickle hare
#

Maybe it's worth mentioning that the "sequential" scan is elementwise (thus embarrassingly parallel) in batch samples and channels, thus already exposes sufficient parallelism (though not in the time dimension)

tough crane
fickle hare
#

that may miss the point that most computaion along the time-range iteration is in parallel

tough crane
fickle hare
#

yeah that's my worrying

obsidian quest
#

@here The paper looks great now

tender karma
#

I would not touch the computational graph @tough crane . Indeed the most efficient implementation just execute operations without cg

tough crane
tender karma
fickle hare
#

I'm personally against computation graph on any algorithmic topic since they are just for autograd and performance optimization, unrelated to the model

#

besides, the construction order of computation graph is unrelated to the execution order, and we are talking about execution order in this context

tough crane
#

I see that it's off-topic at here.

fickle hare
#

it's like, we want to say our execution order of both forward and backward is defined by the loop order (layer, t), while loop t is mostly parallel

#

layer by layer, then time-parallel

tough crane
fickle hare
#

because decoder only transformer naturally behaves like this

neon night
fickle hare
#

and gpt is the representative brand among themthinkies

fickle hare
fickle hare
# tough crane Ummm, branding...

uh, I mean, when you want to say "this new mode is like the decoder-only transformers", the first short name come into your mind will be gpt...

neon night
pale nexus
fickle hare
#

not the execution order

outer vine
#

update cases template

fickle hare
#

oh I see it mentions a bit about parallelism...

outer vine
#

this is the current template for case study

fickle hare
#

maybe a bit smaller inner margin for code blocks?

outer vine
#

i think the final format should be in line with the whole paper, so the research lead should give a final decision by directly changing the first template in Appendix J, and i will help change the rest.

tender karma
#

@neon night we need to coordinate a bit to guarantee consistency. we are using rnn mode, gpt mode, parallelization..

neon night
#

@fickle hare He majors in parallel computing. You can coordinate with him about these things.

tender karma
#

Fantastic, thanks

fickle hare
#

thinkies The current terminology on these things throughout RWKV community does mess a lot...

#

GPT mode: During training and prompt preprocessing in inference, we do time-parallel execution for all matmul (stack along the time axis, thus embeddings (B * T, C) @ weight (C, C)), only leaving time-sequential WKV (yet fused in a custom CUDA kernel), making it more bandwidth-effective
RNN mode: During decoding in inference, we do the timesteps one by one, like in Transformer decoding with KV cache, thus not using the custom CUDA kernel for WKV as well

#

Is that clear enough? I don't know if we are to keep the names in the paper though, maybe it's up to @obsidian quest's decision

#

Once we decided that we will need to go through the paper to make it consistent

regal basalt
tough crane
obsidian quest
fickle hare
#

Then "time-parallel mode" for training and prompt processing, "time-sequential mode" for decoding?

tender karma
#

Please check if you'd like to keep part of my content here (I'm okay with you throwing it away):

Although RWKV is a general recurrent network, its current implementation focuses in the task of language modeling (RWKV-LM)\footnote{https://github.com/BlinkDL/RWKV-LM}. The current implementation exploits a feature that distinguishes RWKV from other RRNs, that is: the channel-mixing block does not require any information from previous states, and thus can be applied in parallel to all time steps, thus greatly reducing the total execution time when the sequence is known in advance (benefiting both the training phase and the processing of the sequence before autoregressive generation). The time-mixing block is also in this sense parallelizable to the computation of \textit{key}, \textit{value}, and \textit{receptance} vectors but then requiring a sequential scan in updating attention scores \textit{wkv}, \textit{aa}, \textit{bb}, \textit{pp}. We call this approach "time-parallel"-mode as the model's temporal context surpasses the inherently sequential nature of a recurrent network that in theory precludes parallelization.

fickle hare
#

I'd say it depends on the space budget

#

It makes things much clearer, but maybe not really necessary as @neon night has pointed out that the LRU paper already mentioned that

tender karma
#

in my opinion (but I am biased) we can move part on that on the appropriate time-parallel mode section (replacing the Transformer-like..)

tender karma
#

seriously, I'm already glad that the point I raised about terminology was also taken up by @obsidian quest. As well that RWKV can be used for but it is not a LM per se. For writing, I know I am long-winded and don't want to force with little space 🙂

fickle hare
#

I don't really see a difference between the parallelization of LRU and RWKV yet; although RWKV started much earlier than LRU publication

tender karma
#

The difference I see is in the underlying motivation that allow that, which is IMO the model's temporal context

#

(w)

#

btw no questioning that the practical point is that non-linearity activation in the RNN recurrence equation can be removed to enable parallel training.

fickle hare
#

This seems to lack context? I'll try update a version to see if it gets better

tough crane
#

@obsidian quest Fixed table4,5 in appendix.

neon night
#

I think the paper is basically finished but I'm also biased

misty cedar
#

that might be another paper though

fickle hare
#

4.2 and 4.3 updated a lot. Please check if it reads good, thx

#

My Grammarly is not working on Overleaf right now and my English is not so good, so alter the text on your will

pale nexus
#

The Attention Transformer (AFT) (Zhai et al., 2021) replaces dot-product self-attention with a computationally efficient alternative based on factorized attention coefficients that maintains global interactions between inputs and the contex
Should we rather says here that ATF is in fact a multi head attention where 1 feature dimension = 1 head ?

obsidian quest
fickle hare
#

No I didn't?

pale nexus
fickle hare
obsidian quest
fickle hare
#

As it's already the 4th section, we are expected to talk more about the details I think?

neon night
#

Sorry, I was giving it a unfair prompt. Now it says it depends

fickle hare
#

Yes, I was modifying it to match the subsection name better. But basically, they are talking about the same "advantage"...

neon night
fickle hare
#

(Is this really the case?)

neon night
#

I think the section title is better to be "Transformer-like Parallelization" and "RNN-like Inference"

fickle hare
#

Maybe "Transformer-like Parallelization in Time" and "RNN-like Sequential Decoding"?

neon night
#

Figure 6 needs to be png or it is loading very slowly 😩

fickle hare
#

Maybe downsample the points?

neon night
neon night
tough crane
#

@rich raptor ICould you share Fig 4 csv file and script to plot, if you have them? To re plot with a bit large font according to @mortal latch ‘s comment

tender karma
neon night
#

And I think in case the paper is more than the page limit, the section 6 "Scaling Laws" needs to be moved into appendix.

tender karma
paper dove
neon night
paper dove
tender karma
#

This for the section titles, than I think we agreed to call the two modes as "time parallel"-mode and "time-sequential"-mode.

#

This makes a lot of sense to me and we are all happy: consistent and robust names and we also make the connection with transformers and GPT "style"

#

I must say that "is implemented as a simple offset in the temporal dimension at each block implemented in PyTorch \citep{paszke2019pytorch} library as \texttt{nn.ZeroPad2d((0,0,1,-1))}." Respectfully, with this PyTorch code reference, it seems to me a bit randomly thrown in there

#

@neon night @fickle hare did we remove intentionally the "Context" section?

fickle hare
#

(I'm not online when it got removed so..

#

(I don't know what happened to that section

tender karma
#

Again intrinsic bias, but I liked it -reason for removal? if too week it is a good fit for the RNN-style

fickle hare
neon night
#

I have to help shorten 4.2 and 4.3 because it's getting longer than I expected again 😅

neon night
#

I don't remove it but people want to, because I don't have enough time to revise every section. I work way slower

tender karma
last mauve
#

Ok everyone, we're reaching the finish line for the v1 arxiv. A few new temporary rules:
1. No major changes without explicit approval by me or @tropic minnow.
2. If you remove anything, it needs to be commented so that it remains in the latex. No more deleting from the latex outright.
3. No new authors will be accepted for the arxiv version

last mauve
obsidian quest
last mauve
#

Yeah @tropic minnow -- what's the status of inference? I'm targeting a monday morning arxiv submission so that it goes live before the EMNLP anonymity deadline

neon night
young sparrow
#

Heh, I woke up this morning and went “huh, I guess I don’t have any obligations today I could sit down and seriously contribute to the RWKV paper!”

I’ll still do an editing pass and leave my suggestions, and I want to stress that I’m not asking for special treatment. Congrats everyone on the hard work

neon night
#

Although I think that part about cross attention is not justified also. 😩 @obsidian quest Does RWKV have capacity to do things similar to what cross attention can do?

tender karma
#

my point is just that, working with the state, the state itself containing the information e.g. of the prompt eliminates the need for cross-attention

#

look at the BART NLI task for zero-shot classification; this is a case where RWKV skip cross attention intrinsically

young sparrow
#

I think it might be a good idea to make a list of such claims / intuitions, remove them from the arXiv version, and add it with real experimental evidence to the EMNLP version

neon night
#

Yes. I think cross attention is very powerful, can do multimodal things like text2image, text2audio. The phrase "eliminates the need for cross-attention" is too strong

young sparrow
#

A lot of papers like this overclaim, and the rigor of our analysis and the scale of the models trained is one of the biggest factors in our favor

#

We can easily train a small CLIP model with RWKV to see what happens though

#

(just not by monday)

neon night
#

I'll make the claim softer by now, until further investigation

tropic minnow
obsidian quest
tropic minnow
young sparrow
obsidian quest
tropic minnow
#

(modulus rwkv-169, this is roughly the state @100 toks generation. will repeat with 256 for all)

fickle hare
#

is >= 1k possible? that might expose a huge difference

tropic minnow
regal basalt
#

how to fix the huge space gap NotAmusedCat

fickle hare
#

On the lately updated 4.2, there are still some issues:
a) 4.1 is still mentioning GPT mode, need to get fixed
b) 3.1 overlaps with the new 4.2, need to dedup at either side
c) 4.2 is mostly explaining the fig 1c, so add a ref would be better

outer vine
regal basalt
#

alright

fickle hare
#

Also, the current fig 1c does not really demonstrate how channel-mix executes (just a long green box)...

subtle oak
outer vine
#

honestly, i don't understand this figure

#

there is not even the explanation for green color

fickle hare
tropic minnow
tropic minnow
# outer vine

yea i see. i dont see the point of talking so much about rnns (figure and even putting their equations from papers 20yrs ago) when even the formulation of RWKV as an rnn is in the appendinx, and our own rnn-like equations are in the appendix. i would look at shortening that section and push some content into appendices. curious to see what others think. we could also do it for EMNLP and have it like this on arxiv

outer vine
#

can't agree more

#

imo, a figure like this in AFT paper would help better illustration

tropic minnow
regal basalt
outer vine
#

and i think the key point we should emphasis would be the wkv formulation and its relation with attention and recurrence. things like token shift, custom cuda kernel, specific implementation like nn.ZeroPad2d((0,0,1,-1)) are like tricks to improve the performance and efficiency. all my personal opinions. curious to see what would you think of this

tropic minnow
# outer vine and i think the key point we should emphasis would be the wkv formulation and it...

pushing the zeroPad and cuda Kernel to 4.7 Additional Optimizations seems reasonable. will do soon. In parallel, what do you think about shortening a bit the QRNN section in 3. background, perhaps keeping it more high level (removing equations or pushing them to an appendix) . i think we could also expand a bit on 2. Background -> Attention Free Models for the Attention-Free transformer given its parallelism with RWKV time-mixing block

fickle hare
#

+1 on shortening QRNN

outer vine
#

agree

#

personal view, i would expect a picture like this to better show RWKV (apologize for the poor quality of this drawing.)

#

(the red line is a equals sign

last mauve
young sparrow
young sparrow
#

@paper dove @rich raptor the main argument against using RNNs to my knowledge is this plot from Scaling Laws for Neural Language Models (plus convergence issues?). I think we should have the data to replicate it with Pythia + RWKV? Would that be a light lift to add to the Scaling Laws section?

neon night
young sparrow
tough crane
# outer vine can't agree more

It was rejected to include AFT into the background section when I suggested. At first, templates has section skelton with title RNN(3.1) and Transformers(3.2).

tough crane
neon night
#

@tender karma Your points about state and cross attention can be added as future work. 😌 and AGI safety

tender karma
tough crane
neon night
outer vine
tender karma
#

Alright let me take a look

#

Following the pinned note: shall I just write as a comment and then you see, or directly as text?

neon night
tough crane
neon night
tender karma
#

Perfect, on it

outer vine
#

I think the figure makes it point in the QRNN paper, but personally i don't think this similar one makes much sense in this paper by simply using different color to differentiate QRNN and RWKV

neon night
#

Don't add anywhere except 4.6, where I made a draft for you. Don't make new sections @tender karma

outer vine
tough crane
#

@obsidian quest Do you want to include AFT's formulation and figure into the background section?

Possible choices are: (1) replacing 3.1(RNN) with AFT, or (2) adding AFT section into background section 3, or (3) not including (current status).

tropic minnow
young sparrow
neon night
tropic minnow
#

(I dont have test loss by sequence position in test data for rwkv)

young sparrow
young sparrow
#

I've made it about half way through the paper, but my editting has been derailed by needing to go find many citations that should be in the paper but aren't. This paper doesn't currently cite:

  • Pythia
  • the Pile
  • the Eval Harnss
  • OPT
  • BLOOM
    to name a few. You cannot use or compare to other people's work in your paper without citing it. The entire paper needs to be reread with an explicit goal of identifying missing citations.
tropic minnow
#

will be cited

young sparrow
#

I left a bunch of comments, I hope they’re helpful.

obsidian quest
young sparrow
#

So the elephant in the room is the scaling laws section. This section is wrong as-is because it follows Kaplan et al’s flawed methodology rather than Hoffman et al’s improved one, and my original plan was to frame this as an initial exploration with more to come. However the more I think about it the less I think these are really the right plots to show anyways.

  1. The exact parameters of the scaling laws are so context-specific that nobody cares what your numbers are in general.
  2. We know that the optimal trade off for tokens to parameters is likely to change (and specifically shift more in favor of tokens) compared to how it currently is but not by how much
  3. “Scaling laws for RNNs” is not a novel or interesting thing, and is in the original scaling laws papers.

Based on these three points, I think that the best thing to do for this paper is probably do the same analysis again (how long did it take?) using Pythia models and plot them on the same axes hopefully this will show no gap, and therefore provide additional evidence of good scaling. If that can’t be done, we can still replicate this plot from the Cerebras GPT paper because we have the Pythia test set loss value

#

To be clear by “replicate this plot” what I mean is take this plot and add Pythia to it

#

But I do think that the explicit scaling laws calculations should be:
a) pushed to the appendix
b) clearly labeled as a Kaplan et al-style experiment that we plan on following up on in the next version of the paper

tropic minnow
tropic minnow
young sparrow
young sparrow
tender karma
#

@obsidian quest @neon night I enjoyed talking about AGI, but here I really let myself go (although reasoned). It is in 4.6. If you use it, well, if you throw it away, I am just fine 🙂


We speculate that exploration of RWKV state-centric designs can enhance AGI safety. The state (or \textit{context}), summarizing past inputs, it might offer not only predictability but also an enhancement in interpretability. Its manipulation can guide behavior and enforce safety. Recurrence supported by temporal "awareness" could lead to stable systems and state-initiated generation may boost computational efficiency\footnote{In language models, initiating generation from the final post-prompt state could obviate prompt reprocessing, thereby bolstering both efficiency and data security.}. Despite challenges in managing high-dimensional states, these promising leads merit further investigation.

obsidian quest
paper dove
paper dove
obsidian quest
#

RWKV 14B ctx1024

paper dove
misty cedar
#

RWKV

#

I suppose

young sparrow
#

For non-embedding param counts vs model label

#

You’ll need to math FLOPs yourself but it’s easy and there’s a calculator pinned in #scaling-laws if you don’t know how

paper dove
neon night
young sparrow
neon night
#

Does the Author Contribution section appear on the final paper?

outer vine
#

just out of curiosity, why Johan S. Wind is not the co first author? I learn a lot from his blog, and he wrote the cuda kernel for RWKV

young sparrow
tender karma
tough crane
#

@mortal latch Increased font size in fig:4.

neon night
#

A new 3.2 highlighting AFT 😇 3.2 needs a new title

neon night
fickle hare
#

parallel scan is simple but requires an additional sweep over VRAM

#

it's not always helpful

#

I thought about modifying the impl but end up finding that we already have sufficient channels to parallelize

#

if you want to mention that, maybe cite it and say "with longer sequences there is potential ..."

neon night
#

You know this paper right? LRU also uses this technique

fickle hare
#

I don't know this paper but parallel scan is so simple a technique...

neon night
#

I mean parallel scan over the time dimension

fickle hare
#

yeah I don't know this paper in specific before your message but I always knew the linear recurrent/wkv can be parallel over time dimension

neon night
#

https://arxiv.org/pdf/1709.02755.pdf in contrast, this is the paper who do parallelization over channel dimension in linear RNN. I cited this paper. I think we're using precisely this paper's method

fickle hare
#

(always curious why simply applying some well-known implementation technique will produce a paper in the AI research field, since the day of ring-allreduce introduced to distributed training)

neon night
neon night
neon night
fickle hare
#

While WKV is possible to get parallelized through time but we simply didn't do that, due to the already sufficient parallelism

#

The current status is the same though.

neon night
#

I have a question, why is the mode called "time-parallel" while it's not really parallel over time thinkies

#

Anyway I wrote this

tropic minnow
#

anyway i didnt write it

#

yessir

#

a100 80gb

obsidian quest
fickle hare
#

there is really little flops in wkv anyway

paper dove
# young sparrow Pile test set loss for Pythia models: 70M -> 2.504 160M -> 2.186 410M -> 1.971 1...

The lowest test loss on RWKV is 1.75, while on Pythia, the lowest loss is 1.582. Cerebras paper states, "Pile test loss is crossentropy in nats/token. We correct all crossentropy results for different vocabularies to be comparable to the GPT-2 vocabulary." Is it because of the difference in vocabulary size? If so, direct comparison may not be appropriate. May I ask if you have the uncorrected loss from Pythia?

obsidian quest
uneven blade
#

Have we got any agreements on naming consistency? For example, using either time mix, Time Mix, Time Mixing, etc. throughout the paper.

tropic minnow
#

whos author of figure 5? could we see it in log scale for y?

tough crane
#

I am not completely sure about the following comment:

"Edward Raff: This needs a call-forward that RWKV will have parallels/relation to QRNN's design, otherwise section 3.1 reads very weirdly."

No need to compare QRNNs and RWKV in the context of parallelizing RNNs ?

tough crane
uneven blade
#

An example multi-round dialogue that could add to the paper.

tropic minnow
tropic minnow
young sparrow
mortal latch
tropic minnow
#

updated to better represent token-shift

obsidian quest
#

Please update the "Tell me about ravens." result because I have never seen such bad responses on https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
a better example:

Ravens are large, black birds with a distinctive white head and neck. They are found in most parts of the world, including North America, Europe, Asia, and Australia. Ravens are known for their intelligence and problem-solving abilities. They are also considered to be symbols of death and transformation in many cultures. Ravens are often associated with the afterlife or death because they have been known to eat carrion or even other birds. In some cultures, ravens are also believed to be messengers of the gods or guardians of treasure.
tropic minnow
#

kk done. examples look quite cool now

tough crane
tropic minnow
#

3.2 Transformers and an Attention Free Variant any reason for equation 8 duplication?

spiral minnow
#

The last 2 paragraphs in Section 4.6 (Harnessing Temporal Structure for Sequential Data Processing) seem like they belong much more in a future work section, or possibly in the conclusion, right?

tropic minnow
#

Peng Zhou, Qihang Zhao, Rui-Jie Zhu, Jiaming Kong, Johan S. Wind, Samuel Arcadinho @bronze frost @snow zealot pls add affiliation to authors section

tender karma
spiral minnow
tender karma
#

perfect and thanks

tropic minnow
#

@mortal latch objections to moving figure 5 to appendix? its currently in scaling laws but i think it illustrates more the long-context side of rwkv rather than scaling?

tropic minnow
spiral minnow
zealous snow
#

Should we add a footnote stating that the order of authors other than the cofirst authors is alphabetical by last name?

#

and can anyone help to add author affiliation?

mortal latch
neon night
#

The appendix about gradient is flawed 😅 let me fix it

gusty condor
#

Suggestion: use log2(context_length).
Also, should the x-axis label be 'Context length' instead of 'Token position'?

mortal latch
gusty condor
#

Also, the first sentence in the abstract:
Transformers have "revolutionalized" almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length
should be revolutionized, not revolutionalized

gusty condor
gusty condor
#

It's really clear.

#

"Context length" not "content length"

#

Also, in the y-axis, "x 10^0" is not necessary

mortal latch
#

It has been changed to base 2

#

See the latest version

pale nexus
#

still missing some authors affiliation

mortal latch
#

I have added the affiliation info in the main text. However, for authors without affiliations, they are affiliated with EleutherAI for now. This information will be updated later.

gusty condor
#

Thanks! Should I add my contribution?
\paragraph{Ruichong Zhang - Tsinghua University} Proofreading and typo corrections; Advices on \ref{fig:ctxlen_rwkv_loss}.

subtle oak
#

Hi all, I add my affiliation institute (20,21), but I found that the space between the author list and abstract is extremely tight and use the vspace command can not solve that, can anyone help to fix it?

tropic minnow
zealous snow
#

ok i fix it

subtle oak
young sparrow
#

Also, in cases where people have the same last name, the standard in English is to alphabetize by first name. So the end of the list should go Jian Zhu, Peng Zhu, Rui-Jie Zhu.

#

@obsidian quest I know people said you can put whatever affiliation you want, but listing an “organization” that doesn’t exist will cause confusion because people will try to look it up.

young sparrow
#

Is there a reason you don’t want to put either “independent researcher” or “EleutherAI”? I was expecting you to put one of those

obsidian quest
obsidian quest
outer vine
zealous snow
young sparrow
paper dove
zealous snow
#

By the way, may I ask what is your timeline for scaling RWKV to 100B?

#

and the 1.7T data version

tough crane
#

IMO, training for 100B params could be after 20B(GPT-NeoX), 30B(OPT, LLaMA), 60B(OPT, LLaMA), 70B(Chinchilla)

#

But BlinkDL might have a more agressive plan.

young sparrow
#

We need correct scaling laws studies before making decisions about substantially larger models

tough crane
#

it's like a RWKV version of Pythia.

young sparrow
#

I would love to see RWKV Pythia

tropic minnow
plucky crypt
#

Hi. I made a few experiments to compare RWKV , ChatGPT and GPT-4. Results are not stunning, but still all of them I have included at the end of the appendix with a comment. This is only a draft so if you would agree to attach this section to the final version of the article I will edit it.

tropic minnow
#

author of Figure 9: Effect of small initialization embedding? can we try having it as EPS or PDF format? so quality is preserved under resize

paper dove
paper dove
obsidian quest
obsidian quest
#

@plucky crypt ok you can include [RWKV-4 w/ GPT prompt] & [RWKV-4 w/ optimized prompt] in Table 6

#

And note that P-tuning can be very effective for RWKV because we can directly tune the full state, and we will do this in follow-up papers

obsidian quest
#

I find even 0.1B RWKV-4 "World" can chat in 100 langs

paper dove
obsidian quest
obsidian quest
neon night
tropic minnow
#

Scientific work published at EMNLP 2023 must comply with the \href{https://www.aclweb.org/portal/content/acl-code-ethics}{ACL Ethics Policy}. We encourage all authors to include an explicit ethics statement on the broader impact of the work, or other ethical considerations after the conclusion but before the references. The ethics statement will not count toward the page limit (8 pages for long, 4 pages for short papers).we can think about this for the EMLP. monday soft deadline is about arxiv

gusty condor
#

Excuse me, which specific CPU is used in the experiment of Appendix J?

tropic minnow
obsidian quest
#

better RNN cell graph 🙂 pls update

gusty condor
#

How many cores did it use?

tropic minnow
tropic minnow
#

alright i'll make a pass in a few hours to standardize the remaining rough edges. pls make all planned remaining contributions asap.

regal basalt
#

What's the deadline again?

snow zealot
#

Lambda cloud instance with 30 CPU 200 GiB and a A100 with 40gb

snow zealot
tropic minnow
young sparrow
#

It seems like Appendix F is really important, in that it’s part of what allows us to train RWKV models at large scale. If that’s the case, it should be in the main body

tropic minnow
# young sparrow It seems like Appendix F is really important, in that it’s part of what allows u...

but it is of little novelty compared to attention free transformer: https://arxiv.org/abs/2105.14103

young sparrow
obsidian quest
# tropic minnow done

cool pls fix position of [r_t] and color of [sigmoid]. move [sigmoid] slightly rightward
Move [3] and (X) slightly upward

young sparrow
obsidian quest
young sparrow
obsidian quest
#

while in RWKV it has to be a simple exponential decay

neon night
#

I think AFT is also stable (if w is chosen properly), we are comparing gradient stability against RNNs

#

a new Appendix F shows that AFT's KV operation is stable

obsidian quest
#

both are much better than usual RNNs

neon night
#

Yes we're not so novel against AFT but the AFT paper doesn't prove stability like us

obsidian quest
#

it's natural to arrive at AFT when we linearize QKV attention - the main contribution of AFT is they find sigmoid[Q] & exp[K] is a great combination

young sparrow
obsidian quest
#

I think it wont happen in reality when you train an AFT
AFT is stable. It just has less capacity, so the LM performance is not very good

young sparrow
#

Okay, so Eric’s comments about novelty compared to ATF are irrelevant

neon night
#

We can replace 4.5 by Appendix F. Appendix F is more rigorous than 4.5, just a bit scary

tropic minnow
#

so its basically this conclusion: #1103039376184852622 message

obsidian quest
neon night
tropic minnow
tropic minnow
tropic minnow
tropic minnow
#

@bronze frost TODO: proofreading this is a good moment. pls leave latex comments wherever you find something wrong/(that could be improved)/(that needs details)

#

All, we're approaching the soft deadline for monday. Paper is looking very good. Thanks everyone for your contributions. Now it's about improving those rough edges.

Will do a pass later for standardizing affiliations and author contributions to format specified at section start. Will comment the current ones so information is preserved. Pls make sure information is there.

plucky crypt
tropic minnow
#

how many would you need

plucky crypt
#

ok, I will try to find good prompts for rest of the datasets and run eperiment, for now I will put -

young sparrow
#

I’m trying to run the Pile test set eval from scratch on Pythia but something seems to be very wrong with the runtime. Going to do some debugging and report back

#

Ah I was using a batch size of 1

#

This seems weirdly low? Pythia 70M

|           Task           |Version|    Metric     | Value  |   |Stderr|
|--------------------------|------:|---------------|-------:|---|------|
|json=train:text:test.jsonl|      0|word_perplexity|133.5446|   |      |
|                          |       |byte_perplexity|  2.0859|   |      |
|                          |       |bits_per_byte  |  1.0607|   |      |
#

@paper dove @obsidian quest how did you compute Pile test loss for RWKV?

tender karma
#

batch size?

young sparrow
#

Pythia-410M

|           Task           |Version|    Metric     | Value |   |Stderr|
|--------------------------|------:|---------------|------:|---|------|
|json=train:text:test.jsonl|      0|word_perplexity|39.8875|   |      |
|                          |       |byte_perplexity| 1.7397|   |      |
|                          |       |bits_per_byte  | 0.7988|   |      |
young sparrow
young sparrow
#

I removed a ton of \vspace commands. Using \vspace is a very crude method for arranging figures. It is a) strongly discouraged in general and b) the absolute last thing you should do on a paper. The removal of over 50 \vspace commands appears to have made no visually obvious changes to the paper

obsidian quest
#

@young sparrow do you have raw token loss for pythia models

tropic minnow
#

@young sparrow here's the script i used to benchmark time and memory consumption, which downloads wheights from HF and loads using the rwkv pip package. maybe its helpful

tender karma
#

It is! Thank you @tropic minnow

tropic minnow
#

credits to @snow zealot for the development hahah

tropic minnow
#

Should we discuss what happened to the Scaling Laws section? I acknowledge there have been previous objections ( #1103039376184852622 message ) and they are commented now. Any reason?

young sparrow
#

I’ve spent a lot of time trying to figure out a way to post-hoc correct them and I can’t find one.

#

I think that a) should be disqualifying in and of itself, but even if it’s not then b) and c) seem to refute any alleged usefulness.

#

Let’s do it right and put it in the EMNLP submission. But there’s a basic responsibility to not put incorrect and misleading information in the preprint.

tropic minnow
young sparrow
#

I’m sorry that I didn’t mention it explicitly again when I made the change.

neon night
#

so the conclusion here should change. Also I changed "draw parallelisms" to "draw parallels"

young sparrow
#

We scale the model to 14B params and compare performance with transformers

#

The fact that we don’t derive explicit scaling laws doesn’t mean we don’t showcase scaling

gusty condor
#

The spelling of "behavio(u)r" lacks consistency

young sparrow
gusty condor
gusty condor
#

Table 6 is too small, barely identifiable

neon night
#

the same appendix J ends weirdly, almost like abruptly, and maybe the indentation should change

neat heron
#

Been going down a RWKV deep dive recently while scouting for good base models to work with. Great coincedence that there happens to be so much discussion around it at this same time 🙂

#

I honestly think it's doing a big disservice by referring to itself as just an RNN. I feel like the fact that it's ultimately derived from Apples attention free transformer is one of the most interesting aspects but seldomly talked about e_think

#

Maybe the AFT isn't the most flattering aspect, but I think that it's just very interesting and catches the eye to warrant a deeper dive, atleast that's what happened for me

young sparrow
#

Have you read it?

neat heron
neat heron
#

Ok I shall read it now, and i'm happy to hear you guys are stressing that part heavily 🙂

neat heron
#

Gotta go to bed soon so mainly doing slow skimming through the paper, but I'd just like to say that I have one of the most common types of color-blindness, and I approve the colors used for the charts 👍 Very easy for me to distinguish the lines 🙂

#

Overall it's a great looking paper and I love that last couple sentences at the conclusion WICKED
and the fact that it significantly beats ChatGPT in MathQA is seriously impressive, and that's not even the RWKV model trained on 1.7 trillion tokens yet. (or is it?)

uneven blade
#

It's not 🙂

neat heron
# uneven blade It's not 🙂

So much potential, I wish the paper great success and I'll do a deeper dive on it tomorrow, can't wait to fine-tune some insane models on RWKV-V12-14B once it's fully trained on almost 2T tokens 🔥

gusty condor
neat heron
#

Might sound a bit out there in terms of paper discussion, but I saw this mentioned somewhere amongst the HF X Raven announcement a few days ago and found it interesting;
RNN's or atleast the way RWKV does things seems to be more closely mimicking certain aspects of the brain.

#

specifically in terms of the locality vs non-locality aspects (Transformers being more of the ladder, while RWKV and the human brain tend to be more of the former)

tough crane
obsidian quest
gusty condor
#

Not necessarily, RWKV does not look back at previous tokens

gusty condor
neon night
#

@obsidian quest I added this because I think time decay (fig. 9) is inductive bias?

obsidian quest
neon night
pale nexus
#

While many alternatives Transformers have been proposed with similar , ours is the first to back up those claims with models
What are the proposed alternatives ?

tropic minnow
neon night
#

added more citations in 4.5 about tackling gradient problem in RNN

gusty condor
#

Almost deadline?

tropic minnow
last mauve
#

I'm doing a final pass and submitting to arxiv over the next hour

outer vine
#

Hi, I left two comments yesterday, but they haven't been resolved

#

may i just split this into blocks? it seems not a consecutive dialog flow

gusty condor
#

No for reproducibility

last mauve
# last mauve Ok everyone, we're reaching the finish line for the v1 arxiv. **A few new tempor...

@everyone -- There have been a few new authors added since this deadline:

  • Bartłomiej Koptyra
  • Bolun Wang
  • Ruichong Zhang
  • Stanisław Woz

If you're on this list, please DM me and prove that you contributed before the deadline and describe what you did. We want the RWKV community to be authors, but we need to guard against people jumping in before the deadline, adding a comma, and claiming authorship.

If I don't hear back you will be removed.

ionic patio
#

Any more room to contribute?

#

Ah guess not

last mauve
outer vine
young sparrow
tropic minnow
#

for the model evaluated, i think its base rwkv-4 most likely but would be nice to know more @obsidian quest (data comes from: #1103039376184852622 message)

Discord

Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

obsidian quest
gusty condor
#

I intentionally set topp=0

#

This is good for reproducibility

young sparrow
obsidian quest
young sparrow
regal basalt
# outer vine

Yea sure I was thinking about making this a set of logic questions when I added this lol (but if it’s too cluttered then nah)

obsidian quest
# young sparrow I’m not sure which of my questions this is supposed to answer

Typical method:

  • ctx1k -> 2k [10B tokens] -> 4k [till almost-plateau] for 1B5 / 3B
  • ctx1k -> 2k [10B tokens] -> 4k [10B tokens] -> 6k [10B tokens] -> 8k [till almost-plateau] for 7B / 14B
    The zero-shot number are almost unchanged after these.
    I computed Pythia numbers with full test samples, and I think all of them are less than 1k tokens.
boreal atlas
#

There is no reference to Fig. 6 (\ref{fig:inference_time}). I would suggest Samuel Arcadinho adding it in Sec. 6.

young sparrow
last mauve
#

The paper has been submitted to arxiv.

gusty condor
#

We might be able to see it at 9AM Beijing time tomorrow morning (UTC+8)

obsidian quest
obsidian quest
last mauve
obsidian quest
last mauve
young sparrow
#

@obsidian quest the current paper is set to be announced on arXiv in 8 hours. Do you have a plan regarding a Twitter thread / announcement?

obsidian quest
obsidian quest
young sparrow
obsidian quest
young sparrow
#

@everyone if you are an author of the paper and are on Twitter, please DM me your Twitter username so I can tag you in the thread when it goes live in six-ish hours.

#

Also, does anyone know what the largest RNN ever trained previous to this is?

young sparrow
young sparrow
torpid token
#

100M

#

Iirc

tough crane
#

Small....

#

All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011.

Ummm.... : 🤨

young sparrow
young sparrow
#

First draft (some image attachments are planned but it's work to appropriately interweave them in discord)

Everyone knows that transformers are synonymous with language modeling at scale… but what if they weren’t? Over the past two years @obsidian quest and team has been hard at work figuring out how to scale RNNs to unprecedented scales. Today we are officially announcing a preprint detailing RWKV: a reinvention of the RNN for the transformer era.

Note that this paper is a work in progress, and its release is forced on up by anonymity deadlines. We are planning on continuing to improve and update the paper (including explicitly deriving scaling laws!) and you can come to the discord server for the latest https://discord.gg/z9SGyZE6EE

Claiming that you can match a transformers’ performance is nothing new, and plenty of other papers put forth that claim. What makes RWKV special is that we actually train models up to 14 billion parameters and show consistently competitive performance with token-matched transformers! As far as we know, the largest previous RNN is two orders of magnitude smaller.

RNNs struggle to scale because of how they parallelize, but making the time decay of each channel data-independent, we are able to parallelize RWKV the same way transformers are during training! After training, it can be used like an RNN for inference.

Our design is largely inspired by the “Attention Free Transformer,” which we realized could be written as an RNN if we use circular matrices as "w" in its formula. AFT alone isn’t able to match GPT’s performance, but inspired by it we continued to make progress on “RNNifying” transformers.

RWKV isn’t without its flaws. While we do approximately match the performance of transformers, our anecdotal experience is that it’s more sensitive to prompts and struggles to incorporate very long range information more than traditional transformers do. We are continuing to work to quantify these phenomena.

Our models are available for download on the @huggingface hub (warning: inference appears to be bugged at time of writing) or you can use our library: https://github.com/BlinkDL/RWKV-LM

[a couple tweets of tags and acknowledgements go here]

tough crane
chilly niche
young sparrow
spiral minnow
young sparrow
#

@everyone if you are an author of the paper and are on Twitter, please DM me your Twitter username so I can tag you in the thread. You have half an hour or whenever I get around to it, whichever happens second.

sharp sonnet
#

Can anyone find the preprint on arxiv? I thought it should've been out at 8 PM EDT today but I am unable to find it

young sparrow
last mauve
#

Is that normal? Maybe it's updating?

young sparrow
last mauve
chilly niche
#

arxiv takes a while to update each day

#

if you're impatient you can watch it slowly process in order of arxiv IDs

neat heron
#

it's all planned as part of their program to get certain emotional reactions out of authors to train their new emotional sentiment analaysis model they've been working on /s

young sparrow
#

Current list of authors with names replaced with twitter tags if I have it

@BlinkDL_AI @eric_alcaide @QuentinAnthon15

@AlbalakAlon, @SSamDav, Huanqi Cao, Xin Cheng, Michael Chung, @GrellaMatteo, @kranthigv, Xuzheng He, Haowen Hou, Przemysław Kazienko, kocon_jan, Jiaming Kong, Bartłomiej Koptyra, @lazercuber, @SriIpsit, @FerdinandMom, Atsushi Saito, @XiangruTang, Bolun Wang, Johan S. Wind, Stanisław Wózniak, Ruichong Zhang, @ZhangZhenyuan3, Qihang Zhao, @zp_pengzhou, @lukeZhu20, @Rudd80856040
last mauve
#
young sparrow
paper dove
neon night
#

our circulant matrix looks like this

paper dove
neon night
paper dove
#

RWKV universe is coming

obsidian quest
#

Table 5 AFT-simple should be 1.046 1.209
I am training L12-D512 rwkv to check test loss

#

Figure 4 x-axis wrong params scale

gusty condor
outer vine
#

missing Hadamard product here?

tropic minnow
tough crane
tropic minnow
tough crane
# tropic minnow

Thanks!! This comment suggests that MEGA could have two modes.

tropic minnow
#

@subtle oak ^^^

tough crane
#

If I correctly understand.

subtle oak
#

I actually do not add the MEGA’s space and time complexity, I add the table with Transformer, Performer, Linear Transformer, Reformer and AFT-full🤣

tough crane
foggy lake
#

Wow what did i wake up to

#

Their readme is so well filled with awesomeness

tough crane
tough crane
subtle oak
subtle oak
subtle oak
tough crane
#

RWKV(GPT-mode): O(d), O(Td)
RWKV(RNN-mode): current table

subtle oak
#

I think if we use the convolution mode instead of RNN mode, its time complexity will become O(Tlog(T)d) by FFT, and it’s space complexity will be O(Td)

tough crane
#

Oh, I am wrong because of ignoring reducing/merging costs.

subtle oak
#

But now in GPT mode, it still uses the RNN backbone (if you check the CUDA code)

#

So the complexity will become O(Td) and O(Td) I guess...

#

The convolution mode is actually just a theoretical best approach for parallelization

#

So maybe if we mentioned the MEGA's two mode, we also need to claim the mode in RWKV

tough crane
misty cedar
#

RNN mode is just a subset of gpt mode where the inference batch size is 1

subtle oak
outer vine
tough crane
#

EQ 14 is fixed into "\odot"

misty cedar
outer vine
misty cedar
tough crane
tough crane
obsidian quest
outer vine
tough crane
neon night
tender karma
#

What are the sections to roll back, to improve or add for EMNLP? Limit 8 pages right plus Appendix.

tough crane
neon night
#

the FFT optimization is mentioned in footnote 3 I guess, while its faster in theory, in practice O(T) is enough (or not) thinkies

neon night
#

can FFT be useful if we calculate according to this matrix? I think this can mitigate some of RWKV's limitations.
namely, using a circular matrix without causal attention mask for processing prompts to achieve "ring topology" rather than caring about the ordering of the prompt.
just my two cents

fickle hare
#

it has been discussed long ago

#

and is preceded by parallel scan

#

FFT is O(T log T) BTW

#

(O(T) operations in parallel isn't real; you cannot really provide parallelism as large as B*T*C, given that would be millions to billions of elements to compute in parallel)

subtle oak
neon night
#

would you be interesting in implementing a CNN inference mode?

neon night
#

a FFT implementation by Jianlin Su 🤔

sullen horizon
neon night
obsidian quest
#

@sullen horizon will add Long Range Arena numbers

karmic tree
#

I was talking with Hugging Face a couple months back about writing a HF blog post explainer for RWKV but have been on paternity leave - is anyone doing that? If not, happy to lead and collab on it!

karmic tree
obsidian quest
#

Table 5 @last mauve

AFT-simple should be: train 1.046 // test 1.209 according to AFT paper

L12-D512 RWKV: train 1.010 (w/dropout) // test 1.178
trained with AdamW wd 0.1, dropout 0.1, bsz 16, initial LR 6e-4

fickle hare
#

brute force exp(n*u) won't really work

misty cedar
# neon night a FFT implementation by Jianlin Su 🤔

Reminds me of the wkv power triangle implementation

import torch
class wkv_power(torch.nn.Module):
    def __init__(self, dims, T):
        super(wkv_power, self).__init__()
    
        self.register_parameter(
        self.register_buffer("mask", torch.ones(T, T).tril().unsqueeze(-1).to(torch.bool), persistent=False)
        self.register_buffer("tri", ((torch.arange(T).expand(T, T)+1).t() -
            torch.arange(T)).tril().unsqueeze(-1), persistent=False)
    def forward(self, k,v, r):
        vx_kx = (k).exp().unsqueeze(0) .expand(
            2, k.shape[0], k.shape[1]).clone()
        vx_kx[0] *= v
        t = ((self.time_decay.expand(self.T,self.T,-1)*self.tri).exp()*self.mask)
        # vx_kx[0][0] += state[2]
        # vx_kx[1][0] += state[3]
        rza = torch.einsum("rki,jki->rji", vx_kx, t)
        vx_kx *= self.time_first.exp()
        vx_kx += rza
        vx_kx[0] = r*vx_kx[0]
        vx_kx[1] = 1/vx_kx[1]
        wkv = vx_kx.prod(0)
        # state[2] = rza[0][-1]
        # state[3] = rza[1][-1]
        return wkv
tropic minnow
tropic minnow
#

(i think it might be faster to use more elementary primitives)

obsidian quest
tropic minnow
young sparrow
#

@obsidian quest if I want to train a RWKV model of X parameters for Y tokens, do you know how I should set the rest of the h params? Is there an approximate formula?

obsidian quest
young sparrow
#

If you don’t know the number but do know the amount of FLOP/second you get during training we can reverse engineer it

obsidian quest
young sparrow
#

What about 2048 context?

obsidian quest
#

same training speed regardless of ctxlen

young sparrow
#

What about a model half the size?does speed increase linearly?

obsidian quest
#

yes

young sparrow
#

Okay, so for every 1B params 1B tokens it takes 34 hours?

#

Does that sound right

#

(It doesn’t to me…)

#

No, that would mean a 1B model trained on the pile would take over a year

#

Did it take 30 days to train the 14B model?

obsidian quest
young sparrow
#

What is “Gt/day”? Gigs tokens per day?

obsidian quest
#

here efficiency = (Gt/day) * (B params) / (#A100s)

young sparrow
#

So efficiency = B tokens x B params / A100 / Day

#

That’s exactly the number I was looking for ^_^

obsidian quest
#

probably still have 20% room for optimization

young sparrow
#

So if we want to spend 15 days doing experiments, we have time for 30 (B params) (B tokens) / A100

#

Woah what are you running on 336 A100s right now o.O

obsidian quest
young sparrow
#

So if I assume we can get 64 A100s for scaling laws experiments, we get 2,000 (B tokens) (B params)

young sparrow
#

Okay, can we get all combinations of the following training runs launched @obsidian quest?

Tokens (B): 1, 2, 4, 8, 16, 32
Params (B): 0.025, 0.05, 0.1, 0.2, 0.4, 0.8

#

Should take only 100 A100-days total

#

(Param counts don’t have to be exact, if you give me a list of actual param counts I can adjust the exact token counts to compensate)

obsidian quest
#

we can train on minipile https://arxiv.org/abs/2304.08442 do we have a 20b-tokenized version

#

can use the method to generate minipiles of different sizes

tropic minnow
#

Xingjian Du, Leon Derczynski, Bolun Wang pls add your contributions to Appendix A: Author Contributions

young sparrow
grim linden
last mauve
last mauve
grim linden
#

yes, he read the list until koala and then stopped

#

sorry for clickbaiting 🫠

spiral minnow
#

Does it bother anybody else that the contributions section isn't in the same order as author list?

fickle hare
#

someone on Zhihu asked
"Why is time complexity of linear transformers said to be O(Td^2)? Do they assume linear transformers use some d^2 kernel functions?"
I don't understand linear transformers so repost here.

burnt gulch
karmic tree
tropic minnow
#

i think the zhihu guy might be right

subtle oak
#

Yeah actually I assume that the kernel complexity is d^2...

#

The formula of the linear transformer can be represented by this

#

And some papers just multiply K and V as the first, but do not use the kernel, like cosFormer

subtle oak
# subtle oak

I guess they use the same QKV structure, e.g., multiply KV as first and then Q

#

I apologize for the simplification of the complexity analysis, if we need the precise estimation, the complexity need to be replaced with O(Tk^2)

#

but there are some papers use the k=d

#

like cosFormer and Spikformer

#

Maybe we should describe a more general complexity, so we need to use the k?

tough crane
#

Could we re-upload a hot-fixed version to Arxiv?

mortal latch
#

no, it's anon period now. We can only update it after emnlp review

tropic minnow
gusty condor
#

Should we prove the Turing completeness of RWKV?

young sparrow
#

And that paper in particular is extra meaningless because the proof hinges on an assumption that’s not actually true of transformers

#

If you use their formal model but change arbitrary precision to finite precision it stops working

last mauve
tropic minnow
last mauve
tough crane
young sparrow
tropic minnow
young sparrow
sharp sonnet
#

Yes, I agree that we should hold this off until the end of review period

last mauve
#

Ok we'll wait

last mauve
#

Ok so our next work item is the EMNLP deadline on June 23. We need to:

  • Condense what we have to 8 pages
  • Tighten up the storyline
  • Resolve the scaling laws issues that @young sparrow reported

My current thought on a core team for this would be @last mauve, @tropic minnow, @spiral minnow, @zealous snow, @tender karma, @rich raptor, @broken moth since all have enough academic writing experience to lead this rewrite (to clarify, anyone can contribute, but these are the rewrite leads). If you want added to or removed from this list, DM me. Once the core team is finalized by the end of the week, I'm going to start assigning sections and working on the EMNLP version with a new overleaf project.

last mauve
misty cedar
#

Can someone point me to the part of the paper that references the modified wkv forward function to alleviate overflow errors?
Anyone trying to reproduce from scratch is going to run into that.
The unmodified wkv formula only works in float64

steady ether
#

@misty cedar This part?

Key search terms are avoid overflow

misty cedar
#

Thanks:)

void quartz
#

One of my friend in SF wants to do a podcast episode on RWKV, specifically to highlight alternatives to transformers

https://www.latent.space/podcast

This is in part, due to the strong positive reception from the paper (and me pestering them on RWKV for weeks)

Anyone interested? They are hosted in SF and prefer to do podcast in person but can be remote.

It is expected to get very technical (time/channel mixing) into how things differ from transformers and the pros and cons (aka the paper)

In overall I do believe it is good exposure for RWKV

(I have asked blink prior to opening up the question here, I also know the host well, and he can prepare in advance the topics so you do not end up surprised or uncomfortable in the podcast)

The podcast by and for AI Engineers! We are the first place over 50k developers hear news and interviews about Software 3.0 - Foundation Models changing every domain in Code Generation, Computer Vision, Data Science, and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Guests from Databricks, Glean, ...

tropic minnow
tropic minnow
#

hey @obsidian quest i could help launching these experiments #1103039376184852622 message on the cluster if you're too busy but i would need the training settings you're using for the other RWKV models

obsidian quest
young sparrow
#

I think that scaling laws would be a big value add to the paper, but we don't currently have the necessary data to do it correctly

tropic minnow
# obsidian quest ok pls list the experiments you'd like to test

probably these:Tokens (B): 1, 2, 4, 8, 16, 32 Params (B): 0.025, 0.05, 0.1, 0.2, 0.4, 0.8 Should take only 100 A100-days total (Param counts don’t have to be exact, if you give me a list of actual param counts I can adjust the exact token counts to compensate) as referenced in #1103039376184852622 message by @young sparrow

tropic minnow
obsidian quest
tropic minnow
# obsidian quest My method: const LR_init for 10~20G tokens, then exponential decay to LR_final I...

nice. are config params here https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py correct for training models similar to current rwkv-4 models on HF?

GitHub

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

#

actually @obsidian quest i think it would be way easier for everybody (you as well lol) and way more reliable (consistency etc) if you launched the training runs on eleuther cluster. i can patch it if you're too busy but likelihood of an experimental mistake increases a lot lol

young sparrow
obsidian quest
#

i decay LR when the loss decrease rate is below a threshold

young sparrow
obsidian quest
#

I begin the decaying of LR when the loss decrease rate is less than "3e-4 per 40M tokens" - just a random threshold
This happens when the model is trained for 10~20G tokens (more so for larger models)

young sparrow
#

I'm having trouble following what that means. Can you state it explicitly, like it's an algorithm?

#

Is it something like this?

if |loss(step[current]) - loss(step[current - 40M tokens])| < 3e-4:
    lr is decreased by ???
obsidian quest
#
if smoothed(|loss(step[current]) - loss(step[current - 40M tokens])|) < 3e-4:
    begin the exponential decay of LR
young sparrow
#

okay so that's for starting when the decay happens

obsidian quest
#

yeah and it's simple exponential decay after this

#

example (gray = green here)

young sparrow
#

And the decay rate aims to reach the target LR after how many tokens? The size of the remaining dataset?

obsidian quest
young sparrow
#

This is done manually right? I see the following comments in the code currently:


# By default we are using exponential LR decay.
# Here are my suggestions for training.
# Let's say you are training a L6-D512 model.
# 1) Set lr_init = lr_final = 8e-4. Let it run for some mini-epochs, until you feel like reducing LR.
# 2) Check epoch_save_frequency and make sure the partially-trained model is saved. Ctrl+C to stop the run.
# 3) Set lr_init = 8e-4, lr_final = 1e-5, betas = (0.9, 0.999).
# 4) Set EPOCH_BEGIN & LOAD_MODEL to load the partially-trained model. Continue the training.
# 
# For L12-D768, set lr_init = 6e-4. For L24-D1024, set lr_init = 4e-4. For L24-D2048, set lr_init = 3e-4.
obsidian quest
#

yes manually

#

however i think this is mostly useful for small batchsz training. cosine decay is fine for large batchsz

young sparrow
#

bsz = batch size?

#

I see final_tokens=n_epoch*len(train_dataset)*ctx_len

#

If I want to train for a pre-specified number of tokens and then stop, how do I determine how to change this? So my dataset will have more tokens that I actually use

obsidian quest
#

The best method will be to work out a formula that can provide good LR schedules for any [ParamSz - DataSz - BatchSz] combination
For example, I believe the best LR schedule for a tiny DataSz is [constant LR]

young sparrow
#

How big is "tiny"

obsidian quest
#

several G tokens

young sparrow
#

Where is the LR decay type actually set? I see the initial and final LRs, but where do you set it to exponential decay

obsidian quest
#

around 10~20G tokens for pile models

young sparrow
#

No, where in the code

obsidian quest
young sparrow
#

You support warm-up right? So if I wanted to make the switch from linear to exponential happen automatically, I can set the warm-up lr to your preferred constant?

last mauve
young sparrow
#

@obsidian quest I've added (extremely hacky) support for automatically switching from constant LR to exponential decay and custom dataset sizing in my fork. Can you see if it runs as anticipated?

https://github.com/StellaAthena/RWKV-LM

obsidian quest
young sparrow
obsidian quest
#

ok i think it can work

tropic minnow
last mauve
young sparrow
void quartz
fickle hare
#

another person on Zhihu commented that the receptance gate is a gate for output instead of for forgetting

#

I agree on his opinion toward this, the gate is not even on the time passing route

uneven blade
#

Agreed.

tough crane
# fickle hare I agree on his opinion toward this, the gate is not even on the time passing rou...

I agree with this. I think that the following paper's method could be regard as \sigma(R_i) = 1.0 in RWKV. To consider an extreme case, if R_i is either 0 or 1, then RWKV choose one of the two: "take" or "skip".

https://arxiv.org/abs/2112.05682

tropic minnow
#

I mean nothing stops wkv from being negative but yea the “correct” intuition would be “keeping the negative” then

tropic minnow
tough crane
#

In the context of MLPMixer vs gMLP, does R act like a time-decaying parametrized version of "token mixer"?

tropic minnow
tropic minnow
#

@paper dove do you have the code/settings for the small init embedding test?

young sparrow
#

I’ve gotten feedback from a bunch of people that the current explication is too dense and it’s hard to understand why decisions are being made. The best way to make progress on this would be for someone who is very familiar with the architecture and it’s design to flesh out the prose, working in tandem with someone who is less familiar but more experienced with writing. I’m not sure who a good candidate for this would be though.

I also think that having Section 4 reorganized and rewritten by one person would be a big boon to accessibility.

#

@obsidian quest have you been able to run my adapted implementation? If it works we can start scaling laws experiments with much less manual work.

#

While doing the aforementioned modifications to the training code I learned several important details that are not described anywhere in the paper currently. I can add them, though I want to note that I’m approaching the level of contribution where I would like to be included as a coauthor (attn: @obsidian quest @tropic minnow @last mauve)

fickle hare
#

BTW, what exactly weren't mentioned in the paper?

young sparrow
young sparrow
#

The actual trained models also lack the infinite context that the paper claims, per my convo with BlinkDL. If the models don’t have it we shouldn’t claim it even if a “less lazy” (his words, not mine) implementation would have it

fickle hare
#

Oh, I see. If speaking on training stuff, I think there are also some customized data loading order (my_pile_stage, etc.), but I don't think Blink have described that anywhere.

young sparrow
#

There’s also no mention of DeepSpeed or ZeRO in the paper currently

#

Instead there’s a vague “oh this parallelizes easily” assertion

fickle hare
#

As general optimizations on distributed data parallelism, I think just mention them during describing the implementation would be okay

#

Also the gradient checkpointing is implemented via DeepSpeed, but I don't know if Blink has been using it in his pretraining

young sparrow
#

The point isn’t that it’s a log of work, simply that it’s important details currently missing

tropic minnow
#

i would say the current manuscript focuses on RWKV as a component used to later build a language model and prove that it is effective for it. if i understand correctly, you want to: add more details/specs about RWKV-LM (learning rate, frameworks, training setup, etc) and unify/harmonize/simplify architecture explanation

tropic minnow
# young sparrow The actual trained models also lack the infinite context that the paper claims, ...

this is not really true? they dont lack infinite context length. they are just not trained with that. Nothing prevents you from getting a RWKV trained model and start generating sequences of 30K tokens. The problem is that it was not trained with such long sequences, so it might not be very useful. But the good thing about RWKV is there isnt a time dependency in the number of parameters, so the same model can be used for very long or very short sequences, just as an RNN.

young sparrow
#

It’s literally a paper about language modeling. That’s the only benchmark used anywhere in the paper and the primary draw

tropic minnow
young sparrow
obsidian quest
#

someone in RWKV discord trained with 100k ctxlen without issues

young sparrow
#

That’s fine, but the point is that the paper doesn’t justify the claims about infinite sequence length. We can include these models, we can add mathematical arguments, we can add scaling tests. We need to add something though

#

You don’t get to appeal to evidence not introduced in the paper to justify claims made in the paper. The fact that evidence exists somewhere doesn’t make the argument correct.

obsidian quest
#

we can train some very long ctxlen models, or improve the cuda to support infinite ctxlen

fickle hare
#

state chaining kernels + temporal gradient checkpoint would work well enough for any long sequence imo, yet we need to do that training

#

(if we want to claim the infinite sequence feature)

#

the simplest thing to do might be weaken "infinite" to "architectural change is not required for extending sequence length", and demonstrate the result from existing models with different supported seqlen

tropic minnow
fickle hare
#

(then next problem would be what is to be proved

tough crane
#

IMHO, even if the word "infinite context length" is deleted in this manuscript, linear order complexity in Table 1 is a selling point.

obsidian quest
young sparrow
obsidian quest
young sparrow
obsidian quest
#

each crash = increase about 0.001 loss in early training