#RWKV-papers

1 messages · Page 3 of 1

young sparrow
#

How many GPUs are you running on? We can probably spare some dedicated 8xA6000s

obsidian quest
#

8 GPU for each run

young sparrow
#

Let’s see how bad the crashes are and I’ll move things around mid week if needed

tender karma
#

I believe we need clarify a bit better. We have the RWKV which is general purpose RNN than can potentially replace every LSTM-alike in your projects, describing the neural model (e.g, rwkv cell) without even mention the LM task. Then, we say we focus on RWKV-LM, with the LM on top. As pure RNN I started (and then removed) a fundamental experiment section with the basic stuff to evaluate RNNs such as addition, copying tasks, etc. it performed great of course but the LSTM performed almost equally in such simple tasks so there were no point at the end.

fickle hare
#

so with whom should I work on reorganize the section 4?

steady ether
#

@young sparrow I'd be glad to jump in and help with rewriting an initial draft for Section 4. I've run RWKV a few times, but I've always had some questions about the details. I have a knack for breaking down complex concepts which should be useful here.

misty cedar
#

Someone reading iterations the paper

#

Note where he gets confused/misunderstands stuff

tough crane
misty cedar
#

I'm halfway through, and I am just noting he is very very confused about what token shift is, so it may be worth elaborating on that ig

regal basalt
#

He seems very confused throughout 20-30% of the paper

young sparrow
#

@obsidian quest something seems to be going catastrophically wrong with the runs

obsidian quest
#

but mostly completed now

tender karma
obsidian quest
neon night
#

@bronze frost Hey, do you think this exp in the cuda backward kernel is not numerically stable:

neon night
#

This is not a problem if we accept small exponentials to be inaccurate in the gradient, since both exp(zexp[i]) and exp(k[i] + o) are less than 1.

steady ether
#

Going to start making some changes to section 4 for clarity. Let me know if you have any suggestions or concerns.

bronze frost
regal basalt
#

Do we need to clarify the definition for token shift?

young sparrow
regal basalt
#

👌👌

fickle hare
serene badge
#

Hello, I would like to extend some help to revise the paper.
Here are some of my immature suggestions. Please correct me if missed some details already covered in the paper.

  1. Lack of explanation for scalability:
    We mentioned that RWKV can scale to tens of billions of parameters, but it is not clear how this scalability was achieved.
    We could provide more details about how we optimized the model architecture and training process to achieve such scalability.

  2. Insufficient analysis and visualization of attention weights:
    While section 4 provides some insights into the interpretability of RWKV's attention mechanism, it would be helpful to include a more detailed analysis and visualization of the attention weights.
    We could include visualizations that show how attention weights change over time or across different layers in the model.

fickle hare
serene badge
tropic minnow
fickle hare
young sparrow
serene badge
serene badge
#

@fickle hare I made revisions for section 4. Here's the change log.

  1. Fixed some typos in sections 4.4 and 4.5.
  2. Revised sentences in sections 4.2~4.7.
  3. Change the title of section 4.4 "Software Implementation" to "Model Implementation and Architecture".
    Suggestion for Figure 2: The font is too small. If the author can provide the original design file, I can help to revise it.
steady ether
#

@tender karma @serene badge @fickle hare @neon night @uneven blade

Looks like everyone really loves section 4. For the rewrite, here's a summary of all the points brought up so far + my own thoughts.

  • "Infinite" context clarification (1-2 people)
    • (Main paper) We should show a math proof, a graph, or at least talk about how this is supported
    • (Appendix) @obsidian quest or anyone has time to finetune a 7B model with larger context length and just compare it with other models such as MPT-7B-StoryWriter-65k+, this would be extremely helpful
  • Moving definitions into appendix (1 person)
    • We are explaining quite a few things that we could move into the Appendix to save space for more important points
  • Design clarifications (1-2 people)
    • (Main paper) Learning rate, hyper-parameters, optimization techniques
    • Expand on the usage of recurrence, time decay, and token shift.
    • (Appendix) Elaborate
  • Editing and coordinating (1 person)
    • Review and edit the final work to ensure it flows well.
    • Fix abrupt transitions into new concepts.
    • Remove repetitive statements.
young sparrow
#

@steady ether you seem to be confused about the_pile_books3. That’s not the MPT training dataset, it’s a small component of it. It’s also a component that is already in our training corpus

serene badge
#

@steady ether, for the definitions and Design clarifications, I'm thinking that we could use a summary table for all the key features we implemented in RWKV. The format could be like this. Then we could move some definitions or explanations to the appendix section.

steady ether
steady ether
young sparrow
#

Please read what you’re linking to before making claims about it

#

It was trained on 1T tokens of text and code that was curated by MosaicML’s data team

steady ether
fickle hare
#

I can work on the code but I'm not sure if I'll have the GPU-hours to fine-tune on that.

fickle hare
# serene badge <@995416401697321032>, for the definitions and Design clarifications, I'm thinki...

Some thoughts:

  1. In "Transformer-like Parallelization," we want to mention the following:
    a. In our training process, most of the computation (which includes all the matrix multiplication and token shift, only excludes the WKV recurrent operator) is parallelized in the time-axis, similar to Transformers/QRNN/LRU/... but different from GRU/LSTM/...
    b. The WKV operator has the potential to be parallelized as well through parallel scan (If the long context finetune is accomplished later, it will become "have been" instead of "can be")
  2. In "RNN-like Sequential Decoding," maybe more explicitly compare with the KV cache of Transformers? Instead of "Sequential," we may want to highlight more about the constant time & space despite the sequence length in the subsection title.
steady ether
fickle hare
#

@obsidian quest Which checkpoint should I start with if I want to replicate the MPT-7B-StoryWriter-65k+ finetune on our 7B? RWKV-4-Pile-7B-20230406-ctx8192-test949.pth?

serene badge
#

@steady ether @fickle hare, I’ve added some citations to section 4.2.

neon night
#

Just to close a topic.. I found the following form useful in future extensions of RWKV & relatively easy to compute:

neon night
#

For the paper, it probably helps to mention the word "cumulative sum"

tough crane
#

Is it called "time span decayed" cumsum ?

neon night
#

Mind-blowing. If RL is similar to WKV, then a whole bunch of RL techniques can be applied... anyway that's another issue, you write the paper you like

tough crane
# neon night

Exactly !! I'm thinking the same formula 🤣 🤣

neon night
tough crane
tough crane
neon night
# neon night

wkv can be seen as maximizing the normalized reward in the direction of the output label

gusty condor
fickle hare
neon night
fickle hare
#

@obsidian quest I'm implementing long context training with time checkpointing now, and I need some hints around the L2Wrap thing. It seems to be manually scaling the largest element in each token's output logits, in which the scaling factor is related to the total token amount B*T. Should I keep scaling according to the total token amount, even if it would be much larger (~100K-1M, compared to the previous 10Ks) than before?

gusty condor
fervent onyx
fickle hare
#

No, I'm not. Given the limited resource, I decided to do gradient checkpointing for every subsequence and chain them together. 4~8*80GB VRAM won't enable 100K~1M ctxlen I want.

neon night
#

by fixing the time decay factor (at the fine-tuning stage) for example

fickle hare
obsidian quest
#

It's from PaLM paper (section 5)

steady ether
#

@fickle hare I've just revised section 4.4 for clarity.

Could you help clarify these points in section 4.1?

  1. On what basis can we guarantee that linear interpolation will be beneficial in this context?

  2. I noticed the weight output is denoted as Wo. Do you think it would make sense to rename it to Ww for consistency with the RWKV naming?

uneven blade
#

As per token shift, its benefit is a nontrivial one. In the Hungry Hungry Hippos paper https://arxiv.org/pdf/2212.14052.pdf, they design a "shift matrix" that makes "the state x_i to copy from the input u_i, and then pass that information to the next state x_i+1". They do an experiment of Induction Head showing their architecture narrows the gap between transformers on this task.

We can do a similar experiment to show this: whether a 2-layer RWKV with/without token shift is able to learn the Induction Head task in 100% accuracy.

fickle hare
silent urchinBOT
#

Blealtan | Huanqi Cao

tropic minnow
#

it is likely we might have to cite this: https://arxiv.org/abs/2305.19370 (block-parallel transformer, twitter thread came out today) as it is a development on top of memory-efficient attention we already cite (raabe & stats 2022...) with applicability to extend context a lot (up to 64k in the paper)

fervent onyx
#

I've done lots of experiment with the token shift. My main takeaway was that it's playing an important role in token mixing, less so in channel mixing. It's effect in model performance is also non-trivial, in a way that's different for r, k and v. I think it's effect in v has a clean interpretation, but not so for k since it lives in the exponent... The shift could be considered as a tiny convolution layer with kernel size 2 and a softmax (only valid when mixing coeff is positive), when extending it to larger kernel size, i found that it actually made the model more confused than being helpful I think due to these non-trivial effect. If we are doing more experiment, It'll be good to crosscheck these observations...

obsidian quest
soft gull
#

Popped up on my recommended: https://youtu.be/x8pW19wKfXQ

#gpt4 #rwkv #transformer

We take a look at RWKV, a highly scalable architecture between Transformers and RNNs.

Fully Connected (June 7th in SF) Promo Link: https://www.fullyconnected.com/?promo=ynnc

OUTLINE:
0:00 - Introduction
1:50 - Fully Connected In-Person Conference in SF June 7th
3:00 - Transformers vs RNNs
8:00 - RWKV: Best of both wo...

▶ Play video
tropic minnow
#

@young sparrow @obsidian quest how are experiments for scaling laws going?

tropic minnow
#

any progress on this?

#

how is this going @sullen horizon do you need/want help?

obsidian quest
#

he has got good LRA numbers and tuning for better

fickle hare
#

@steady ether I just went through 4.1 and left several comments there. I feel that reorganizing this subsection is really necessary: it basically mixes all architectural designs in a number of paragraphs without clear sectioning. It should be split into several parts, including 1. *former overall architecture, 2. token-shift for both time & channel -mix, 3. output gating for both time & channel -mix, 4. WKV.

#

Besides, I somehow feel that the current writing is still not perfect, maybe after another editing pass we need to call for others' help

#

Seems it's time to split section 4 into multiple sections...

young sparrow
fickle hare
#

BTW I also remember people commenting on our ArXiV paper about lacking ablation study on the different techniques, including token shift, introducing u in WKV, softmax (exponentials) in WKV, etc.

tropic minnow
fickle hare
#

IDK, I'm in no way familiar with ML research drinkies

#

(I major in HPC and never really worked on a ML paper like this)

young sparrow
#

Not essential, but it would be a nice to have

steady ether
tropic minnow
tropic minnow
fickle hare
#

was just mentioning the necessary bits, not in specific order

serene badge
steady ether
#

@fickle hare @serene badge @tropic minnow

Just reworked 4.1. Let me know if it makes more sense now

serene badge
#

Will revise Figure 2 to increase the font size today.

fickle hare
#

BTW the current 4.2 and 4.3 is too fragmented in the whole paper IMO, should think about put them elsewhere, e.g. in the (new) WKV operator subsection

steady ether
fickle hare
#

I think it's worth eliminating 4.2 and 4.3 if we can get the overview to contain necessary information😂

#

4.7 also contains some redundant parts I think

#

the two arch figs and numerous arch formulas might also be unnecessary IMO

steady ether
#

Makes sense, I feel like 4.2 and 4.3 only existed to emphasize that RWKV has "the best of both worlds"

tropic minnow
steady ether
#

After digging into section 4.1, I began to realize that the order of content might be confusing for some readers. We delve into intricate details and then seem to revert back to higher-level concepts.

If we're open to renaming the headers "RNN-like" and "Transformer-like" to something else. I think we can consider the structure in the 2nd image.

tropic minnow
#

i think i like the titles from image with 4.1.1 etc better - they are more objective descriptions and less subjective claims about potential applicability/intention

#

ctx: this one

young sparrow
#

What is the point of Figure 1?

young sparrow
#

The thing that strikes me as weird in the current Section 4 is that “Software Implementation” should probably come last

#

It also sorta feels like 4.2 and 4.5 should be combined, or at least consecutive?

#

The section currently is not systematic. It probably doesn’t matter that much what order we go over the material as long as there’s a clear systemic organization

#
  1. RWKV
    4.1 Architecture
  • Keep current content
  • Compress “4.5 Gradient Stability and Layer Stacking” into a single paragraph and stick it here.
    4.2 Transformer-like Training
  • Keep current content
    4.3 RNN-like Inference
  • Combines “4.3 RNN-like Sequential Decoding” and “4.6 Harnessing Temporal Structure for Sequential Data Processing”
    4.4 Additional Optimizations
  • Keep current content
    4.5 Software Implementation
  • Add a couple mixing citations, such as to DeepSpeed
#

We also need to add the basic info about how the model is trained that is currently missing, like talking about LR decay and providing the h params. That can maybe go in between Sections 4 and 5 along with the scaling laws stuff?

steady ether
young sparrow
#

It looks like the diagram has an error: there’s an extra layer norm at the very beginning

steady ether
#

@serene badge You mentioned earlier that you were going to update Figure 2, could you include this?

young sparrow
#

It might also be clearer to define $\tilde{x_t}=\mu x_t + (1-\mu)x_{t-1}$ and do Eq 12-16 in terms of $\tilde{x_t}$

silent urchinBOT
#

Stella Biderman (she/her)

young sparrow
#

Well, I guess that $\mu$ is different between $r/k/v$

silent urchinBOT
#

Stella Biderman (she/her)

steady ether
#

@fickle hare I haven't really looked at the base code since March. would you mind writing a short paragraph on learning rate/hyperparameters/optimzers in section 4.5. Nothing fancy, just how things are set up right now. We will polish it up later.

A few points I remember that could be relevant

  • There were some issues with model divergence when we upped the context length, right?
  • Something about channels decaying at individual rates based on learned weights and activation during inference
  • There were discussions on LAMB being an possibility, but probably won't be a game-changer. I can't recall the exact reasons though.
serene badge
#

@steady ether I have adjusted the font size and changed it to PDF for Figure 2.

steady ether
serene badge
#

Do you mean the extra layer norm after Input embedding in the right figure?

steady ether
#

I think it's both that, and also the ones in figure 3. We'll have to address both of these.

serene badge
#

OK. I've removed that layer norm in Figure 2. Will revise Figure 3.

serene badge
#

I have revised Figure 3 to remove the extra layer norm.

tough crane
fickle hare
#

Then it could be moved to otherwhere?

fickle hare
fickle hare
obsidian quest
fickle hare
#

@obsidian quest would you please provide the hyperparameters for training on pile? I'm adding the learning rate/optimizer paragraph.

obsidian quest
fickle hare
#

What about the lr_init, lr_final, my_pile_edecay and warmup_steps? I see these are deciding the LR schedule through rather complicated logic.

tropic minnow
#

so the first block would have an extra layernorm. correct me if im wrong

#

if im correct, changes to figures should be rolled back. otherwise, at least fig3 needs a fix here:

tropic minnow
#

@steady ether @serene badge @young sparrow ^^

fickle hare
#

is this really the case? I don't think the current training code implements multi-gpu tensor parallel as Megatron did

tropic minnow
#

@last mauve 👀 👀 👀 i think accessing the full history can have many advantages for resolving unexpected changes and tracking progress over time. could we have it? if you dont want to spend money on this, i think @young sparrow had the paid version so we could transfer ownership. also could we add tracking changes to see who's the author of what modifications?

tropic minnow
young sparrow
obsidian quest
young sparrow
tropic minnow
# young sparrow Oh wild

ok im reverting changes to fig2, 3. @serene badge im using a larger font size for fig2 as u did.

young sparrow
obsidian quest
#

i find it's fine

tough crane
serene badge
young sparrow
young sparrow
young sparrow
tough crane
young sparrow
young sparrow
obsidian quest
tough crane
tropic minnow
young sparrow
#

And makes me worried that there are other things that need to be disclosed in the paper that I haven’t caught yet

tropic minnow
# tough crane Do others say to move section 2 to the appendix?

i don't think we should move the whole of section 2 to appendix. the works described there can be very relevant to readers as they share common objectives with ours. perhaps we could simplify it or move the less relevant part. There's also some work of deduplication to be done, for example this sentence (^attached^) which should go in 3.2 at least (just moved).

#

i agree w @young sparrow on moving figure 1 out of the current place (and placing in appendix or hiding completely,). It's odd the first figure of a paper introducing a novel architecture adds so little to what this arch really is. Especially when figure 3 for example would be much more pleasant to the eye and help a lot more to understand whats RWKV.

young sparrow
young sparrow
#

@obsidian quest So I’m visualizing the data for the scaling laws

#

And I can slice the data by model size

#

But how do I distinguish between runs that ran for different numbers of tokens?

tropic minnow
#

ping @paper dove

steady ether
tropic minnow
last mauve
#

@young sparrow -- Sent you an overleaf invite. Once you accept I can promote you to owner

obsidian quest
young sparrow
last mauve
obsidian quest
young sparrow
#

Perfect

steady ether
#

@fickle hare Is this accurate? Maybe worth using more precise language and also a mention in your paragraph.

I thought we initialized most of the matrices to zero (at least in the March version)

#

I guess zero is a small value, huh? 😄

#

Nevermind, I was looking at the wrong part of the code

spiral minnow
#

Question about equation 11: If we're summing from i=1 to t-1, should the integer in the parenthesis be (t-i)?
If we sum from i=1 to t-1 and use (t-1-i), then the final element of the sum will be (t-1-(t-1))=0, is that on purpose? My understanding is that the final element should attend to the previous token, so it should be (t-(t-1)) = 1

broken moth
#

this part (Appendix C) should probably be corrected, I left a comment

tropic minnow
young sparrow
#

The separate weight for the current token throws me every time I look at the equation

tropic minnow
# young sparrow The separate weight for the current token throws me every time I look at the equ...

but it is actually what happens in the code. theres a time-associated parameter for all positions except for the immediate previous one (see: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py#L186). we're just describing there. perhaps we could mention it in the line below that w_t gets its own set of parameters?

GitHub

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

young sparrow
tropic minnow
# young sparrow Yeah I mean… maybe the current one can be set to 1?

hmmm i dont think so, at least not without affecting performance. the U is there intentionally bc the time-association of W could be too strong of an inductive bias. i think it could be great if we did a small experiment like the SmallInitEmbedding with this difference and see.

#

thats why i keep asking for this #1103039376184852622 message @paper dove

spiral minnow
young sparrow
#

Yeah the current token goes through u I thought

spiral minnow
#

Yeah, that's what we're saying in the text as well, "U attends to the current token" (paraphrased)

tropic minnow
spiral minnow
#

Okay, I get it now. Seems a bit complicated, but maybe that's just how it needs to be for the model to work. Maybe there's a nicer way to write it though, I'll think about it

spiral minnow
#

Wow, the equations are really throwing me off. The current key is weighted by U, the previous key is unweighted, and the key from 2 timesteps in the past is weighted by W. Is that correct? I guess it makes sense but just seems very unusual

#

And by weighted, I really just mean that it gets a bias added to it so that it's actually scaling the value

fervent onyx
#

Yes that's correct, it's equivalent to if you also reweight the previous token - you just multiply numerator and denominator by exp(w) and rewrite u as log(exp(u+w)-1)

tough crane
steady ether
#

Just made some grammar/spelling fixes. However, the Future Work/Conclusions section might need a rewrite. Also spotted that we're using abbreviations like 'LLM' without defining them upfront.

uneven blade
#

@tropic minnow@fickle hare After reading section 4.1, I feel like it might be better to have the order Token Shift -> WKV -> Time/Channel Mixing and Output Gating, because the r, k and v vectors used in WKV and others are defined in Token Shift and I feel lost when first seeing the WKV using these. Also, this is the logical order of how things are computed...

fickle hare
fickle hare
#

also my_pile_edecay decides when to start decaying

tropic minnow
# fickle hare <@469771066399784971> What's ur opinion?

hmm my take was that token shift is a tiny conv we add to increase performance, whereas main RNN-like properties come from (R)WKV, which is the "attention replacement" we implement and what people might be looking for when they read "a replacement to transformers". However @uneven blade 's point about (time-mixing) token-shift preceding the WKV computation is fair. I think we can go that way if others prefer it too, as long as we're systematic in the description of components it should not matter much

fickle hare
#

Personally I think as long as we highlight the WKV as a replacement for self-attention in the overview before we start diving into details, it will be fine

obsidian quest
fickle hare
fickle hare
#

added a paragraph at the end of 4.5 describing details about loss, learning rate, and optimizer.

#

need to summarize the hyperparameters later in the appendix

tropic minnow
young sparrow
#

The paper currently says

It is noteworthy that FLOPs are independent of the context length, unlike regular transformers.
This is false though? Transformer FLOPs is given by 6PD, no term for the context length.

#

Actually all of Appendix B makes little sense. The equations are self-contradictory, we present what I think are supposed to be three different approximations, and an omission of the number of data points entirely.

#

If it's the case that RWKV FLOPs are well approximated by 6PD (just like a transformer) we should derive that and just stop.

#

The text I'm primarily referring to is:

The number of parameters for each model is computed using the formula: $#parameters = 2VD + 13D^2L + D(11L+4)$ where $V$ = 50277 is the vocabulary size, $D$ represents the Model Dimension and $L$ corresponds to the number of layers.

FLOPs is for a forward pass for one token. It was calculated as $6(VD + 13D^2L)$, which is the twice (add and multiply) the number of parameters in linear layers. The backwards pass FLOPs can be approximated as twice that of the forward pass. So the total is $6(VD + 13D^2L)$ per token for training (3x fw FLOPs). It is noteworthy that FLOPs are independent of the context length, unlike regular transformers. The FLOP approximations in this paper are in line with the methodology used by Kaplan et al. (2020).

silent urchinBOT
#

Stella Biderman (she/her)

fickle hare
#

I think it's pointing the second term

#

Okay i think it's correct. Counting Transformer flops per token involves computing self attention against history KVs, which has a FLOPS linear to the history size

#

Why are you only counting the 6PD from the head?

#

(yet the first 6 should be 2 I think

young sparrow
#

I believe that 6PD is a good approximation for total training FLOP for both models

fickle hare
#

fine. with not really long context attention flops are negligible

fickle hare
young sparrow
#

Sorry my D is “dataset size”

#

Not “hidden dimension size”

#

So in per-token units this would simply be 6P

fickle hare
#

oh i see

#

Parameters

#

got it wrong

young sparrow
#

Which is what the text (but not equations) of the passage I quoted says

fickle hare
#

then the problem is whether to mention the square yet smaller term in transformers flops

#

for transformer it's 'approximate' since it throws the context-growing term away, but for us it's accurately 6P per token

young sparrow
#

What about the D(11L+4) term? It goes away, and I assumed that’s because of the same kind of reasoning

fickle hare
#

it's the token shift and wkv parameters I think

#

okay it's not calculating the elementwise muls and adds now...

#

but they are all constant for each token

#

it's missing and I'll do some calculation for wkv and add that

young sparrow
#

I really don’t think having it exactly matters

fickle hare
#

yeah it's negligible compared with the linear layers

#

It's just... the omitted term for us is constant while for transformer is linear to context length

young sparrow
#

That’s not a real difference

#

It doesn’t make us look better to point it out, it makes it look like we don’t know what matters.

fickle hare
#

I agree

paper dove
#

I have seen some people questioning the initialization settings in RWKV. “Initialization of parameters in the popular RWKV model is done by setting all parameter matrices to zero. It is claimed that this approach avoids the noise introduced during the initial learning phase. However, this practice is highly unreasonable. Initializing parameters to zero can lead to issues such as symmetry problems, vanishing gradients, lack of diversity, and slow convergence speed. In small models, zero initialization is rarely used. Instead, methods like Glorot initialization and Kaiming initialization are commonly employed.”

young sparrow
steady ether
#

There's also a section under How it works mentioning this.

paper dove
#

It seems that this approach is counterintuitive for many people, and perhaps it requires more explanation or persuasion. @obsidian quest

steady ether
#

I vaguely remember this discussion from a past conversation. I believe the key point was that because sigmoid(0) equals 0.5, the weights are able to be updated

#

But yes, more clarification on this point would certainly be good.

paper dove
fickle hare
#

maybe also ablation study? initial iterations on small models would be sufficient

steady ether
#

@tropic minnow I've added the ethics statement that I mentioned earlier. Feel free to review and tweak as needed.

obsidian quest
obsidian quest
tropic minnow
young sparrow
#

I cannot find the bug in my scaling laws code

#

I run this

l = defaultdict(list)
for d in df.keys():
    x = d.split(" ")
    loss = float(df[d].sort_values('Gtokens').tail(1)['loss'])
    layer = int(x[0][1:])
    dim = int(x[1][1:])
    print(layer, dim)
    print(params(layer, dim))
    print("---")
    tok = float(x[2])
    l['L'].append(layer)
    l['D'].append(dim)
    l['T'].append(tok)
    l['loss'].append(loss)
    l['params'] = params(layer, dim)
    l['compute'] = 6 * params(layer, dim) * tok
df = pd.DataFrame(l)

which prints out the expected thing:

12 1024
266.684416
---
24 1024
430.39744
---
24 1536
890.962944
#

The very next cell does this though

fickle hare
#

params and compute columns wrong?

#

is params a pure function?

#

ah i see

#

instead of

    l['params'] = params(layer, dim)
    l['compute'] = 6 * params(layer, dim) * tok

do

    l['params'].append(params(layer, dim))
    l['compute'].append(6 * params(layer, dim) * tok)
young sparrow
#

oooo

#

Thank you

#

Eyyyy look at that beautiful straight line

#

(minus the one point which I think is an overflow error)

tropic minnow
young sparrow
#

Or, "I will run the math after my 1:30 meeting" since I just noticed the time

tropic minnow
#

@obsidian quest for the Ethics statement, would be good to know exactly which data has been used to train Raven-14B beyond The Pile

#

current statement describes:

  • Open Source Data (the pile), publicly available data (raven?)
  • Open source training codebase and lower inference cost (democratization)
  • Efficiency in training (effort to lower cost, "sustainable")
  • Various sizes released (accessible deployment, study of emergent phenomena)
  • Easier to generate AI text (lower cost Chat assistant, fake news, misinformation)
  • Potential replication of biases/harmful content in data (but transformer mitigation strategies should work here as well)
obsidian quest
#

do we have any missing runs

young sparrow
#

For example, the equation I'm getting is quite different from the ones the original experiments had. This is the original experiment

#

Hmmm I think my code might have a bug.

steady ether
# tropic minnow <@870137517020688415> for the Ethics statement, would be good to know exactly wh...
#

Also, knowing the exact split/iterations would be helpful

young sparrow
#

Here's the data sorted by amount of compute used, and there are clearly runs that are more optimal (17 and 22 are particularly good for example) but there isn't the necessary data density to really get the tradeoffs optimized

#

This becomes especially obvious when you look at x-axis that aren't "compute"

#

I was looking through the Chinchilla paper and found this, which shows all the configs they trained for their paper

#

What they did was set a total FLOP target and train each model for the number of tokens necessary to reach each target, with 9 targets per model.

#

By contrast we have 7 different models currently

#

So if we can generate more data that would be A+. Just... more models, more # of tokens

#

There is a lower edge to the compute-loss tradeoff currently that's approximately linear. I'm going to try to extract that now

young sparrow
#

It feels like this is the optimal line with the data we currently have

#

Slope: -0.09467861
Intercept: 1.80843822

young sparrow
#

(or in log_10, that's -0.04111839787, 0.7853947398)

#

@obsidian quest to illustrate why this matters, the original value was -0.053. -0.041 vs -0.053 is a huge change

obsidian quest
#

why are these two charts different #1103039376184852622 message #1103039376184852622 message

#

we still need to check if we can actually find a 10^5 compute datapoint on your line lol

bronze frost
bronze frost
# young sparrow Actually all of Appendix B makes little sense. The equations are self-contradict...

Also, while I'm here: I wrote that section in a very early draft (I think it was among the first additions after the tex file was created by someone else) as a kind of internal data table for making plots like the scaling laws plots (with the intent that we agree on one of the approximations for the flops, etc.) But it kinda just stayed there I guess. Feel free to remove it / scavenge it for scraps for other sections.

young sparrow
young sparrow
young sparrow
young sparrow
bronze frost
#

I posted this code, and then @rich raptor made it pretty

fickle hare
#

Some comments on 4.4:

  1. Shall we merge "Custom kernels" to "4.1.2 WKV Operator"?
  2. Shall we remove/merge "FFN with R gate" since it's now in "4.1.3 Output Gating"?
  3. I'm curious whether using the abbreviation "init" in "small init embedding" instead of spelling it completed is intentional.
  4. It seems both "Small init embedding" and "Custom initialization" is talking about parameters initialization, except that smallinit requires some architectural design to cooperate with it. If the former two paragraphs are merged to somewhere else, shall we turn the whole section into sth like "Model Initialization"?
ripe tangle
#

Hey is this paper still taking helpers?

steady ether
tropic minnow
# fickle hare Some comments on 4.4: 1. Shall we merge "Custom kernels" to "4.1.2 WKV Operator"...
  1. Kind of a branding name that has made its way. could rename it for the section title but i'd like to refer it as SmallIinitEmb or SmallInitEmbed throughout the paper for historical reasons (https://github.com/BlinkDL/SmallInitEmb) and bc its a shorter name.
GitHub

LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence - GitHub - BlinkDL/SmallInitEmb: LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

#
  1. yes, but in a sense they're quite orthogonal, as smallInitEmb could be applicable to every transformer and was specifically tested (see experiment) whereas the rest of layers are more specific to RWKV and we dont have as hard as a justification for them, simply trial and error during RWKV evolution
obsidian quest
young sparrow
#

@obsidian quest These are the "good" points

#

Is that what you want?

obsidian quest
young sparrow
#

Lowest loss values

obsidian quest
#

will be great if we can mark L-D-T for each datapoint

young sparrow
#

On the plot? That'd be very hard to read

#

I can send you this as a CSV

obsidian quest
#

make a very large graph 🙂

#

seems we need T64 experiments

young sparrow
#

Here's the CSV with all the points, the color is red if it's on the bottom line I identified

#

Currently sorted lowest to highest loss

#

Tokens are in billions, params in millions

#

compute in units of 10^15 FLOP

#

Oh, here's the points colored red on the scattterplot

#

Oh that's without a log on the y-axis

#

but w/e

#

Gets the point across

#

(note the slope and intercept numbers are different now because these are in log base e while before I was converting to log base 10 since that's what the original work was in.)

obsidian quest
obsidian quest
#

need to smoothen the loss curve before using it @young sparrow

young sparrow
obsidian quest
#

when you download the raw loss curve from wandb, it will be extremely noisy

obsidian quest
young sparrow
#

If I sort all checkpoints from that run by loss I do see that value (and even lower!)

obsidian quest
#

can you plot the loss curve of this run

young sparrow
#

Why is the loss so noisy

obsidian quest
#

because this is the raw loss of each batch

young sparrow
obsidian quest
#

the best method will be to compute a curve fit

young sparrow
#

Here it is on a log-log plot

young sparrow
obsidian quest
#

i am using tiny bsz

#

bsz = 128samples x 1024tokens

young sparrow
#

I'll try subsetting to one in every 10 datapoints then

#

EMA isn't helping, neither is subsampling

obsidian quest
young sparrow
#

A linear fit on the log log plot doesn't work

#

What else would you like me to try

obsidian quest
#

a linear fit of the last 30 data points

young sparrow
#

Line fitted to the last 30 points

#

Everything except the first 50

#

Yeah this simply isn't working

#

Here I tried fitting the line to 30 points near the end of training and then projecting out the next 100

obsidian quest
young sparrow
#

So you want me to fit this line, project it out to the full training, and use that as my loss instead of the observed loss?

#

And re-do the scaling laws experiments?

obsidian quest
#

yeah

young sparrow
#

@obsidian quest

#

hmm that's kinda misleading as the y axis has changed

#

Hmmm. This looks suspicious

#

Variance going up seems like a bad sign

#

(the outlier is from a run that didn't restart, I had been removing it before)

obsidian quest
#

pls send me the L-D-T csv

young sparrow
#

With the predicted values?

#

Or the real ones

obsidian quest
obsidian quest
#

now doublechecking everything

young sparrow
#

It looks like you just rotated my plot and played with the variance lol.

obsidian quest
young sparrow
#

I’m our right now but can check it out in a couple hours

obsidian quest
#

wandb default = only fetch 500 datapts

young sparrow
#

I tried to fiddle with that config but it seemed like it wasn’t doing anything

#

😦

#

So I gave up and assumed it didn’t work the way I thought

obsidian quest
#

works for me 🙂

#

some runs are very short because they are killed multiple times

young sparrow
#

Oh I was doing it inside the API call

#

Whoops

young sparrow
#

How different is the actual vs predicted numbers

obsidian quest
#

just more noises in "actual"
note one of the runs lasted longer than T which was before i added exit_after_T to training code

#

so now i am predicting the loss @ T instead of x[-1]

young sparrow
#

Interesting. I went with x[-1] because some runs made it within a rounding error of T but not actually T. I had assumed this was because it wasn’t evenly divisible by the batch size, but I guess it was the sampling

ripe tangle
#

@obsidian quest Hey is this paper still taking helpers?

steady ether
#

@fickle hare I've cited the 5 fine-tuning datasets that I know we used to the ethics statement. Could you double-check to see if I missed anything?

fickle hare
young sparrow
#

Looking a lot better once @obsidian quest showed me how to fix the data lol

#

(blue points are used for the regression line)

#

Note that both axes have a log on them

#

This gives an exponent of -0.0747

obsidian quest
young sparrow
#

I am using your code

#
  • data
#

I'm just picking up the analysis where you left off

#

These plots worry me though

#

The empirically low loss point with a compute value between 12 and 13 is way off of the line for params and tokens too

obsidian quest
#

24-2048-1.0 is missing and you can ignore it

#

for some reason, your chart is different from mine #1103039376184852622 message

#

the results basically tell us that we should train larger models for optimal T=32

young sparrow
#

No mine is the same, I'm just taking the log of the raw data instead of putting it on a log axis. Here's a log axis

#

(it's slightly distorted due to np.log calling log base e)

#

Here is everything in log base 10

#

Oh there's bunching at 0 due to loss of precision (units of billions and then taking a log). Lemme fix that

fickle hare
steady ether
#

Updated. This also made me realize that we didn't mention the multilingual capabilities of RWKV.

obsidian quest
young sparrow
obsidian quest
#

loss is extremely noisy

young sparrow
#

I know, but I don't think that fitting a linear model to it is something one should rely on fundamentally.

obsidian quest
#

yet it is still a vast improvement

young sparrow
#

In what

obsidian quest
#

for example, your red datapoint is noise

young sparrow
#

@obsidian quest Did you launch more runs?

young sparrow
#

The biggest problem is data scarcity. We can hardly call something paredo optimal if there are no other equi-compute points

obsidian quest
#

yeah could you find blue points using pred_loss so that i can use the info to launch more runs on pareto front 🙂

young sparrow
#

Kk

spiral minnow
#

I love the addition of ' to the variables used in channel-mixing!

#

I'm not sure who has access to Figure 1, but I think we should update the variable names in mixing: R', V', K'

#

Happy to do it if somebody can give me the file

young sparrow
serene badge
#

One small comment on Figure 3, we should add legend to x and y axises. Not sure who’s the author. I can help to update the figure. Also, do we need to add error bars for the accuracy scores?

obsidian quest
#

@young sparrow use these pred_loss data for the most reasonable fit
ignore L6 and T1 results because they are too different from usual runs

young sparrow
#

How do the parameter & dataset curves look?

#

Any less cursed?

tropic minnow
serene badge
#

Cool, I’ll handle it.

serene badge
steady ether
serene badge
serene badge
#

I've updated Figure 3. Added legends and changed to PDF format.

last mauve
#

Ok it's time to buckle down for EMNLP. I'll be doing regular check-ins like we did for arxiv. Here's what currently needs done:
1. The ethics statement (section 11) needs shortened. No longer than a half page. Nevermind we have the space.
2. @young sparrow and @obsidian quest -- What is the status on your scaling laws work? I assume that'll need to be a new figure/paragraph once finished, or will these just replace the current Figure 5 scaling laws plots?
3. We're currently at about 8.5 pages on an 8-page limit. Should we move section 4.5 Additional Optimizations to an appendix? Nevermind we have the space.
4. Figures 4-6 have strange placement, there's some space at the start of Section 7, and Figure 5 is out of order. These figures should instead be split across pages 6 and 7.
5. Sections 8 (Future Work) and 9 (Conclusions) are very long. We should cut or re-word so that a few lines are reduced. Nevermind we have the space.
6. In Figure 6, we should remove the cuda_ prefixes from each legend entry.
7.~~ Result figure captions should be descriptive enough to be self-contained (i.e. easily screenshotted). Figures 3-6 should have their captions updated, but don't make them longer than 2 lines.~~

#

I'm submitting a draft to EMNLP today. Here are the deadlines:
Abstract Deadline: June 16 (Will be submitted today)
Paper Deadline: June 23

last mauve
#

Core author team -- Feel free to add work items to my above list.

spiral minnow
#

Also, I re-wrote the future work into paragraphs rather than bullet points, saving 3 lines

last mauve
spiral minnow
#

👍 I don't think they put a space limit on eithics or limitations sections though

last mauve
#

@everyone -- If you're an author, I need your email for the EMNLP abstract submission if you haven't sent it to me already.

last mauve
spiral minnow
ancient cosmos
karmic tree
#

For me future work has only negative aspects: (1) another valid title for all points under it is "things we didn't do"; (2) it's very rare that things mentioned here are actually done, so they remain as evidence of promises authors made but didn't follow up on. So I always prefer to keep sections like that completely out - big obvious omissions can be mentioned in Limitations

last mauve
#

This sort of format is new to me so you'll have to bear with me.

spiral minnow
spiral minnow
young sparrow
obsidian quest
#

@young sparrow the plot is even better if we only consider non-embedding params

young sparrow
obsidian quest
#

after (sry buggy. see below for update) vs before

young sparrow
#

It looks really good

last mauve
obsidian quest
#

i am running L32-D2560-T16/32/64 (T16 done)

young sparrow
#

I can confirm you didn’t ping everyone

last mauve
tropic minnow
obsidian quest
#

corrected. good fit even for L6 and T1

quaint ingot
#

I have a question. If I understand the paper correctly (and maybe i don't), you have explicit bias toward more recent tokens, wouldn't that degrade the result for some model tasks that are not necessarily Languge related? that kind of bias isn't present in transformers.

obsidian quest
tropic minnow
quaint ingot
obsidian quest
outer vine
#

maybe we could use this space for emnlp submission?

tropic minnow
#

fig 6 updated to remove "cuda"🙂

young sparrow
# outer vine

It's hard to see why this is happening. The spacing seems unchanged when I remove the author block

tropic minnow
# outer vine

this wont survive in the camera-ready version. but i dont know if that is allowed

tropic minnow
young sparrow
#

I removed that and it didn't fix it either

tropic minnow
#

ah sorry \maketitle is the responsible

young sparrow
#

That doesn't have much explanatory power. That's the command that tells LaTeX to display the title block, but could mean anything is to blame.

steady ether
#

I noticed that in 4.6 we referred to the implementation as RWKV-LM, but later, we go straight into RWKV-4. (it might not be clear to some readers). Perhaps we could change RWKV-LM to RWKV-4, or smoothen the transition?

Also, it may be better to change RWKV to RWKV-4 under Appendix G Inference results to be consistent with the other figures.

last mauve
karmic tree
#

A bit of negative vspace around the titlebox is generally OK for ACL subs

young sparrow
#

My vote is for RWKV or RWKV-LM

karmic tree
#

My vote is for RWKV, there isn't a non-LM RWKV and chars take page space

obsidian quest
#

yeah just RWKV

young sparrow
#

Something appears to be overriding our ability to move the top of the text at all. Even using vpsace won't move it upwards

tender karma
#

For example can be applied as BiRWKV for sequence labelling, it works amazingly well

karmic tree
young sparrow
#

What is the "language model" that RWKV is "connected" to?

young sparrow
mortal latch
tender karma
# young sparrow ... do we have numbers on this? Can we put it in the paper?

I don’t think is worth it. My focus is (well, was) dependency parsing so I just experimented with a variant of https://aclanthology.org/Q16-1023/ replacing the lstm with rwkv. However, it is not so cool anymore this task and it would be not so effective for this paper. Maybe for a follow up subject to show out of the box improvements in old fashioned tasks

young sparrow
#

I know @obsidian quest has talked about doing something like ViT using RWKV too

last mauve
#

oops

tender karma
#

Got it and agree. I’ve running a “ELMo” variant with rwkv just for fun. Same dataset as the original so benchmark is possible.

outer vine
young sparrow
outer vine
#

uncomment \setlength\titlebox{6.8cm} and decrease the number will do ( I tried), and the current workaround with \begin{comment} is also viable. But i am not so sure if these two methods would violate the requirement of formatting

young sparrow
outer vine
#

ok

young sparrow
#

Does anyone know if the RWKV implementation in transformers is reliable yet

fickle hare
#

bf16 inference and training should be all good now, not sure about fp16 inference

#

yet there are reports on the cuda kernel not successfully compiled... not really reliable yet, use carefully

young sparrow
#

It didn’t launch out of the box :/

#

I have runs of BoolQ and MMLU on Pythia / OPT / BLOOM if anyone wants to run the comparison in RWKV

last mauve
#

I've submitted a version along with the abstract for EMNLP.

If you did not receive an email from OpenReview: This means you haven't both:
(1) Created an OpenReview account
(2) Sent either me or this channel the email associated with that OpenReview account

If you didn't receive an email, please do these steps by tomorrow. Once we have more authors on the openreview, we can re-order them alphabetically.

last mauve
#

@young sparrow and @obsidian quest -- Your scaling laws plots are the last outstanding results. What's the status? What needs done and who can help?

young sparrow
obsidian quest
#

i sent an excel file with all datapoints

last mauve
#

I'm actually pretty happy with the writeup and overall storyline. If anyone knows academics who can give us good feedback, it would be good to receive that.

Lead authors -- Do a pass now and update anything you don't like. If you need help updating, message work items here.

young sparrow
#

[we’re talking in DMs]

karmic tree
young sparrow
#

One complaint I’ve heard is that people don’t think that 6 evaluations are enough anymore. If we can run MMLU, BoolQ, Natural Questions, HellaSwag, TriviaQA, and RACE that would give us a lot more comprehensive of a picture, and most of the plots from the LLaMA paper (missing math stuff we can’t run right now and code evaluations)

#

(We’ve also gotten the same feedback about Pythia)

last mauve
#

Just added another batch of authors. Some that I'm still missing:

  • ~~Michael Chung ~~
  • Xuzheng He
  • Przemyslaw Kazienko
  • Jiaming Kong
  • Bartlomiej Koptyra
  • Hayden Lau
  • Atsushi Saito
  • Bolun Wang
  • Ruichong Zhang
  • Qihang Zhao
  • Peng Zhou
  • Haowen Hou
last mauve
karmic tree
# last mauve idk what this means

PPL is an abbreviation of Perplexity, Sasha is lead scientist at Hugging Face and a Harvard prof, usually gives reasonably strong signal

young sparrow
#

There isn’t such a thing as a ppl plot/figure, and saying we should include one doesn’t mean anything.

karmic tree
#

I agree. Let me fish out the tweet

obsidian quest
#

don't we have 13 evaluations : LAMBADA PIQA StoryCloze16 Hellaswag WinoGrande arc_challenge arc_easy headQA openbookQA sciq triviaQA ReCoRD COPA

young sparrow
obsidian quest
#

avoid boolq which is very noisy

last mauve
#

If it's just the number of decent evals that matters then we're fine

young sparrow
#

I think that MMLU is probably important to include

last mauve
# karmic tree

This should be resolved by the new scaling plots I believe

young sparrow
#

But, if someone explains how to run the model through the eval harness (HF is still borked) I can take care of things

last mauve
#

Evals were already done before we started the arxiv. @obsidian quest or @tropic minnow -- Who ran these evals and how can Stella reproduce them?

tropic minnow
#

and the plots for the 6 tasks were done with this: #1103039376184852622 message, maybe @serene badge can comment more on any other mods

Discord

Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

obsidian quest
young sparrow
obsidian quest
young sparrow
#

I’ll reproduce it in a bit and let you know

young sparrow
#

Using pretrained=RWKV/rwkv-4-169m-pileraises

  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 384, in forward
    attention, state = self.attention(self.ln1(hidden), state=state, use_cache=use_cache)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 308, in forward
    receptance, key, value, state = self.extract_key_value(hidden, state=state)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 300, in extract_key_value
    key = self.key(key)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
#

Meanwhile BlinkDL/rwkv-4-pile-169m appears to be misconfigured as it lacks a config.json

#

okay now it's wroking and I have no idea what changed

#

...

young sparrow
#

@obsidian quest Do you have the smaller models with all these benchmakrs too? Or just the biggest RWKV

#

This code doesn't run because it relies on a file called rwkv.csv, can you share that file

obsidian quest
young sparrow
#

Oh I missed that colimn

#

and that's the .csv I was looking for, isn't it

#

Okay, so we actually have pretty comprehensive evals they're just not fully presented

young sparrow
#

@obsidian quest the .csv you shared says that the context length of the largest model is 8192 and that the 3B model is 4k. Is that correct? What's the context length for the models that don't have a listed context length?

young sparrow
#

Why does the context length change

obsidian quest
#

all are trained with ctx1024 for 1 epoch, and then finetuned to 2k => 4k => 8k

young sparrow
#

Right, but why are we comparing evaluations on models of different context lengths

#

Why isn't it consistent

obsidian quest
#

longer ctxlen => slightly worse zeroshot if everything being equal
because these tasks only care abt short ctxlen

#

it's just that 4k and 8k models are trained longer, so 7B & 14B can gain some advantage from this

young sparrow
#

dude

obsidian quest
#

1.5B & 3B ctx4k are slightly worse than ctx1k for this

young sparrow
#

You can't do this in a paper

obsidian quest
#

you can list all ctx1k numbers

young sparrow
#

Do you have context 1k numbers for all the models? The csv you sent doesn't for 3B or 14B

obsidian quest
#

you can list all ctx1k numbers

RWKV-4    3-ctx1k    5.24     57.52%    63.94%    73.72%    70.28%    59.63%    59.43%    31.83%    64.27%    28.74%    37.60%    85.70%    11.07%    80.56%    81.00%
R14 ctx1k    14.2    3.81     63.54%    71.05%    77.42%    75.57%    70.24%    62.98%    38.31%    70.71%    32.28%    40.60%    90.10%    24.06%    85.73%    87.00%
serene badge
# tropic minnow and the plots for the 6 tasks were done with this: https://discord.com/channels/...

The 13 benchmark results of RWKV-4, Pythia, GPT-J are included in RWKV.csv.
The 6 benchmark results ("lambada", "piqa", "winogrande", "arc_challenge", "arc_easy", "sciq) of OPT, BLOOM come from pythia/result directory of pythia repo.
Seems the json files of OPT, BLOOM do not contain the other 7 benchmarks ("triviaqa","storycloze16","hellaswag","headqa","openbookQA","record","copa").
I think that's why in the script from @rich raptor , we only plot figures for 6 benchmarks.

young sparrow
last mauve
tough crane
gusty condor
#

I just created OpenReview account too. Been so busy with my final exams

obsidian quest
#

L32 D2560 T64 pred_loss 2.047399

fickle hare
#

@obsidian quest I'm trying to add the exact hyperparameters to the Appendix. In #1083107245971226685 message you presented 6 column groups, are they in the order of 14B/7B/.../169M? In each column group, is the last column tokens trained? Also, it seems your adjustment on batch size during training is not directly visible in this table?

obsidian quest
fickle hare
#

(another point influencing the reproducibility)

#

I can try to recover it though. All training is done with ctxlen=1024 right?

fickle hare
# obsidian quest bsz = 128samples x 1024tokens

I'm a bit confused, is the batch size = 128samples for each GPU? Cause in the LR history file it shows 8043 steps for 332 billion tokens, which counts to ~40000 samples of 1024 tokens each step.

#

Also through analyzing the Gtokens I don't observe any batch size change. It goes smoothly all the way down.

#

(all 315 * 128 * 1024, guess you are using 315 GPUs or nodes lol)

obsidian quest
#

i use 128 or 256 as total bsz. or you may say 128x1024 or 256x1024

fickle hare
#

I see

#

uh it's the epoch_steps in your code

obsidian quest
#

real steps per miniepoch = 40320 / bsz

fickle hare
#

so we won't be able to report the accurate batch size then i guess?

obsidian quest
#

but there are 2068 runs

#

because all runs are killed multiple times due to server issues

#

apply filter for nlayer & ndim & ctx1024 & datafile = BlinkDL/pile/pile_20B, and check the run around the release date on HF

fickle hare
#

let me put the numbers we have in hand into the paper first

#

if i still have time later but not too late, i'll try dig it out

fickle hare
#

Added Appendix Hyperparameter.

#

related cross-reference is also brought back (previously commented out)

tropic minnow
#

@sullen horizon hows LRA going?

young sparrow
obsidian quest
#

i am using these 12 points

young sparrow
#

Why not 6 512 1.0

#

Why those numbers specifically? Even using them, I'm unable to reproduce your fit and looking at the plot it's not at all clear why those were chosen

young sparrow
#

Is that 2*V*D + 13*D*D*L

obsidian quest
#

simply 13*D*D*L

young sparrow
#

(rerunning, vaguely embarressed I missed that)

#

So how did you pick these specific points to include in your fit

obsidian quest
#

ok pls use this. the idea is to pick larger models as T grows

#

for example, the optimal T for L12 D768 is likely around 3

obsidian quest
young sparrow
#

Okay, but why not this one?

#

This one actually shows all the compute-optimal values

#

I'm worried about the excessive reliance on heuristics

fickle hare
#

shouldn't this be a simple envelope?

obsidian quest
#

the envelope is simple in my table

#

the second one is non-optimal here

sullen horizon
karmic tree
obsidian quest
young sparrow
obsidian quest
young sparrow
obsidian quest
#

i mean we havent tested them

young sparrow
#

So no, you don’t know that they perform better

#

It’s really important on a scientific level to not make things up like that. If you want to run them great, let’s add them. But you can’t say “oh I know how this experiment we haven’t done will turn out”

tropic minnow
tropic minnow
# obsidian quest like L9-D768, L18-D1024, etc.

yea i think if we don't have a better datapoint in our data, then that's point is the optimal we have been able to get so far. imo the methods should be as good as possible, even if they don't account for corrections that we might have intuition on but are unproved so far.

young sparrow
#

This is the best we can get with the current data. In the last plot, we see the slope of the line corresponding to Chinchilla scaling I do believe that this line is likely much closer to the true value, but we don't have the sampling density to really tell.

#

(click on images to see the equation for the trend line and r^2)

young sparrow
#

There's a bit of missing data still running (will be done by the end of the day today) but I otherwise have the missing plots as well

young sparrow
last mauve
#

ah ok

young sparrow
#

aka "I forgot to save most of the BoolQ results"

karmic tree
young sparrow
young sparrow
#

@obsidian quest I'm noticing that some of the RKWV evaluations are using acc_norm and others are using acc. Do you have all the results for acc?

last mauve
young sparrow
outer vine
#

hello, i think i may find a little bug about RWKV initialization. In the paper, we said that we initialize all W_{r}, W_{k}, W_{v} to be zeros, but it is not the case in the RWKV-4

#

this line of code is never used in the Init function (which i believe is to control the zero initialization)

#

by adding a print debug line here, i also find that the matrix is not initialized to 0

#

does anyone also notice this? initializing all the parameters to be zeros seems a little bit weird

obsidian quest
#

pls fix that part

outer vine
#

In RWKV-v4, this line of code is never called. So, i believe there is not a parameter matricx initialized to all zeros

#

I am testing v4neo

#

but i think our paper should correspond to v4?

outer vine
#

sorry, but i can't run v4neo because of this AttributeError: partially initialized module 'charset_normalizer' has no attribute 'md__mypyc'

#

but it looks good

#

so please fix this in the appendix. Thanks

outer vine
fickle hare
#

(I guess it's worth mention somewhere in the paper, or just clean up the obsoleted ones in the code base)

#

git always keeps history, so leaving them there unused is unnecessary

young sparrow
#

@fickle hare We should absolutely make a cleaned up codebase that only has the necessary components. The current codebase is pretty unusable to a new person.

fickle hare
#

I've been working on a new Lightning 2.0-based trainer using the new CLI (the most recent improvements are by @void quartz). It's pretty usable now for finetuning, but data preprocessing is still in a preliminary state, and model initialization is missing. Just too busy these days.

young sparrow
#

@obsidian quest which of the models are the ones hosted on the RWKV HF page? How many tokens were they trained, did they do sequence length extension?

outer vine
fickle hare
#

The rwkv-pile series are all trained on the 332G Pile. Checkpoints with ctxlen>1024 in the file name come from sequence extension.

young sparrow
young sparrow
#

Blink told me that the official versions were going to be the ones on the RWKV org page

fickle hare
#

i see. then i have no clue which checkpoint did they convert to HF format 😭

void quartz
# outer vine glad to see someone is using Lightning2.0 and its CLI rather than argparser or h...

you can find it here - if you want to read through it as an alternative - the existing v4neo has lots of "experiment flags" and can be hard to read : https://github.com/Blealtan/RWKV-LM-LoRA/tree/dev-infctx

( I am still helping bugfix and test it by using it extensively in my current experiments - will be helping adding the missing model init / preprocessing - cause i need it too 😉 )

GitHub

RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inf...

fickle hare
#

at some point it should go to a distinct repo and migrate to HF model, after state chained backward is supported by transformers.rwkv

young sparrow
#

The "attention free models" section of the Related Work section was getting a little long so I split out the RNNs into a third subsection.

#

I also added some details about Hyena, as that's simultaneous work where they train a single-digit billion parameter state space model (and compare to us!)

outer vine
#

for anyone who is interested in a clean code base of RWKV for comparison with GPT-series

young sparrow
#

Amazing!

last mauve
#

@young sparrow -- Are you able to get your scaling + eval plots in today?

last mauve
#

Paper's due friday and I want ppl to be able to update writing accordingly in time

spiral minnow
#

Figure 3 (0-shot performance on LM eval): Any intuitions on why Pythia performance drops significantly for the point with highest compute?

young sparrow
young sparrow
#

@obsidian quest The paper currently says:

The number of parameters for each model is computed using the formula: $\text{# parameters} = 2VD + 13D^2L + D(11L+4)$ where $V$ = 50277 is the vocabulary size, $D$ represents the Model Dimension and $L$ corresponds to the number of layers. FLOPs is for a forward pass for one token. It was calculated as $2(2VD + 13D^2L)$, which is the twice (add and multiply) the number of parameters in linear layers. The backwards pass FLOPs can be approximated as twice that of the forward pass, giving a total of $6(2VD + 13D^2L)$ FLOP per token. Notable, this matches the standard formula for FLOP calculations in transformers \citet{kaplan2020scaling} $$\text{FLOP} = 6\cdot [\text{# tokens}]\cdot [\text{# parameters}].$$

Can you confirm that this is correct

silent urchinBOT
#

Stella Biderman (she/her)
Compile Error! Click the errors reaction for more information.
(You may edit your message to recompile.)

young sparrow
#

@tough crane were you the person who put together the evaluation tables in the appendix?

tough crane
young sparrow
#

(This is a historical artifact of Blink starting this work before Pythia existed)

#

If you post the code that generates the tables I can update it pretty easily

tough crane
young sparrow
tough crane
young sparrow
tough crane
young sparrow
#

@last mauve @tropic minnow The scaling laws and evaluation sections are largely done. I'm tweaking some of the wording of the context length extension experiments because it's not true that quadratic transformers can't scale to a context lenght of 8k, but don's currently anticipate major changes to the sections. We were far over the page limit, so I commented out Section 2: Related Work and it seems to fit pretty well now. If people are okay with that not being in the main body, I can move it to the appendix.

#

The current version has the plots in the main body because I find plots much more accessible than tables, but we could flip that and put the tables in the main body (LLAMA does this, for example, but most don't). This would require substantially less space.

last mauve
young sparrow
#

The spacing is janky but I've made that change

#

I generally dislike reporting mean accuracy across tasks, but that's something we can do here

#

I have to run, but I can make the new table tonight or tomorrow morning

#

The more I think about the sequence length stuff the more suspicuos of it I am though

#

This shows loss on the Pile batched by the sequence length of the sequences that we are evaluating on.

#

The claim this plot is making is that we perform better at predicting long sequences than short ones. Maybe that’s to be expected (though I don’t think so) but the effect size worries me. That’s a huge drop in loss!

#

The left half of the image is basically meaningless because sequences of a handful of tokens are often noise

#

But the idea that we see a real loss decrease when subseting our evals to 8k sequences instead of 1k ones seems suspicious to me

spring fulcrum
#

the alibi paper appendix has an appendix on “the early token curse” as a cause for ppl decreasing as seqlen increases: https://arxiv.org/abs/2108.12409

obsidian quest
tough crane
#

Should we have more experiments for much longer lengths from 2^13 to the length comparable model's settings like Hyena?

tough crane
#

Alibi 's experiments are tested up to 6k

young sparrow
fickle hare
#

at least cut the half below 128 tokens which is not really meaningful?

young sparrow
last mauve
#

Ok so we're near the finish line here

young sparrow
#

Here's what the average across all 12 NLP tasks looks like btw

young sparrow
#

@obsidian quest @last mauve @tropic minnow I've done a lot of fiddling with the paper with the primary goal of making sure that all the results in the appendix are actually referenced in the main text, while not going over the page limit. I'm stopping now before I go crazy fiddling over details.

#

(Feel free to disregard if you don't want to update the submitted paper.)

last mauve
#

I'm submitting a final version now. If anyone has any last-minute edits they want reflected before the deadline tonight, ping me here.

obsidian quest
#

hi should remove this arrow

tropic minnow
# obsidian quest hi should remove this arrow

Yes probably. It doesnt represent tokenshift appropiately… will remove it and update the latex figure, but emnlp submission is already done…🙃so it’ll have to go in the updated version

outer vine
#

hi, may i ask what is the detailed setting in benchmarking rwkv inference in the Figure 7 of the paper? From my side, i couldn't get the same results.

#

and this is the result:

#

on one A100, float32, no compile, batch=1, generate 1024 new tokens

#

BTW, the Figure 7 in paper is never referred or explained

young sparrow
outer vine
#

emnlp version

tropic minnow
tropic minnow
#

probably it would be better to release scripts in the open for people to reproduce. @snow zealot are you ok?

outer vine
#

thanks so much! I would check this

outer vine
#

i notice that rwkv is tested with original implementation rather than HF implementaion

#

is there any problems with HF implementation yet?

#

I simply uses this:

young sparrow
outer vine
#

yeah, i test models from this HF space. But from using model.generate() method rather than forward() with torch profile, there is not that huge gap as shown in the paper

tropic minnow
# snow zealot Is it ok for EMNLP?

i mean the RWKV codebase is public... and the preprint as well... so i think as long as we dont promote it it should be... but yea we can wait probably

snow zealot
outer vine
#

I believe it would be a more equitable comparison if we could also pass the KV cache to the Transformer while providing the state to the RWKV. This would ensure a fair assessment of both methods.

tropic minnow
outer vine
#

why kv cache would cause faster OOM compared with full computation?

#
#

ok, for 80G a100, batchsize=1, would this be a big problem?

tropic minnow
#

hmm i see... okay we can try that? is it easy to setup in HF?

#

seems all GPTNeoXForCausalLM have the use_cache=True option we could use.

outer vine
#

yeah, all AutoModelForCausalLM in HF have use_cache option

#

but it is only useful when calling model.generate() method

tropic minnow
tropic minnow
outer vine
tropic minnow
outer vine
#

I believe this problem presents a certain level of complexity, as the real-time cost is determined by a combination of factors such as the algorithm (architecture) and hardware (V-RAM, GPU generation). There are numerous options for benchmarking the inference speed by combining these elements, such as using a GPU with small V-RAM, a GPU without tensor cores, or a GPU that does not support bf16, among others.

However, the most straightforward approach, in my opinion, would be the following:

start_time = time.time()  
new_tokens = GPT/RWKV.generate()  
end_time = time.time()  

Even with this method, there are still various possible variations. For instance, if we were to test on a GPU with limited V-RAM, a transformer-based model with kv cache might need to perform frequent exchanges between GPU and CPU memory, which could result in significant latency.

#

I would like to kindly recommend that, for a model with favorable time and space complexity during inference, it would be beneficial to utilize a product-level GPU such as the K40 for comparisons with other Transformer-based models. It is worth noting that employing an A100 GPU for serving is not a common practice within the industry.

tropic minnow
outer vine
outer vine
#

Hi, @tropic minnow , do you finish the code? I implement it on my side and the results are contrary. This is on one A100(80G) gpu.

#

i cann't think of a reason not using kv cache for transformer model in inference

outer vine
#

so i would suggest using product-level GPU/ longer context/ large model size to show the superiority of a inference-friendly model, RWKV.

#

the current figure in the paper is misleading

tropic minnow
obsidian quest
outer vine
#

hi, @obsidian quest where is the buggy part of RWKV HF implementation? Maybe i can help fix it. The main point here is that if we use a large enough and fast enough GPU(A100) to benchmark inference speed, the Transformer is also linear. Check this:

obsidian quest
outer vine
#

cumulative time is what we show in the paper..

obsidian quest
#

with the correct implementation, rwkv will be like a const line around 10ms

outer vine
#

ok, i would try rwkv pip

outer vine
#

hi, I tried rwkv package and this is the result

#

but i am not so sure if this is a fair comparison

obsidian quest
#

rwkv pip package is using pytorch for almost everything, except the WKV operator

#

while HF transformers are using MHA operator instead

#

both WKV and MHA are CUDA operators, so i will say it's a fair comparison

outer vine
#

but rwkv pip uses torchscript, right? Actually, transformers could be fast enough with various optimization techniques (e.g. vllm)

#

and i have looked through the HF implementation, while no obvious bugs found

#

one possible difference was: HF implementation doesn't use wkv kernel when doing inference

outer vine
#

oh, i find that HF implementation could be significantly boosted by using torch.compile

obsidian quest
#

can you try it for rwkv-pip too?

fickle hare
#

the pip version is using torch 1.x jit, might be less efficient than 2.0 compile

outer vine
#

I am using HF implementation with torch.compile and longer context length

#

this complies with the induction that Transformers are only quadratic when # tokens is big enough

obsidian quest
#

transformers per-token speed = const factor + linear factor

#

accumulated time = linear factor + quadratic factor

obsidian quest
outer vine
outer vine
#

From long former paper

obsidian quest
outer vine
#

yes, that is what i mean

#

the current one only evaluate on 1k context, where the Transformer and RWKV are both linear

#

hope this could be reflected on the next version of our paper

tropic minnow
young sparrow
#

I'm also hoping to have MMLU numbers in the next version (though have deprioritized this as we can't update the paper for a couple months still)

outer vine
outer vine
young sparrow
outer vine
#

hhhh, i didn't see it in call for paper. I must have missed something

young sparrow
#

You may not make a non-anonymized version of your paper available online to the general community (for example, via a preprint server) during the anonymity period. Versions of the paper include papers having essentially the same scientific content but possibly differing in minor details (including title and structure) and/or in length.

[...]

You may not update the non-anonymized version during the anonymity period, and we ask you not to advertise it on social media or take other actions that would further compromise double-blind reviewing during the anonymity period.

https://2023.emnlp.org/calls/main_conference_papers/#anonymity-period

outer vine
#

got it. thanks

obsidian quest
young sparrow
young sparrow
#

@here I've made a short survey that I would appreciate people taking a moment to fill out. The primary goal is to get a better understanding of who comprises the members of our community. It should just take a minute and will be very useful 🙏

https://forms.gle/eTEtjGK4U7CfKBWT6

obsidian quest
#

https://twitter.com/BlinkDL_AI/status/1677593798531223552 A tiny RWKV with 2.9M (!) params can solve 18239.715 * 9.728263 or 4.2379 * 564.778 - 1209.01 etc. with CoT, while being 100% RNN (L6-D192) 🤯

A tiny #RWKV with 2.9M (!) params can solve 18239.7159.728263 or 4.2379564.778-1209.01 etc. with CoT, while being 100% #RNN (L6-D192)🤯The trick: generate lots of data with reversed numbers (denoted by "f" here) to train the model🚀Try it now: https://t.co/l7CDb6Rirl

tender karma
young sparrow
#

@obsidian quest Let’s start keeping notes on adding languages to RWKV, in case you want to write another paper. It’ll make it easier to not have to go back and figure out what was done after the fact!

void quartz
#

Not sure whats the procedure for paper feedback / corrections is - rwkv is cited here : https://arxiv.org/abs/2307.08621 - as a model without "training parallelization"

(hoping for someone here to know the process)

outer vine
#

the current implementation of RWKV training is indeed recurrent

#

but in theory, i believe it is also parallelizable

#

wkv_{t} actually doesn't depend on wkv_{t-1}

#

this retentive model uses a bunch of tricks to train while only refering to RWKV as Transformer with Time-mixing..

void quartz
#

ahh so if i understood you right, they are using the stricter definition of training parallelisation? So they ain't wrong - but in practise is a meaningless distinction, because we can saturate our GPUs eitherway

outer vine
#

and i am not sure if they are using torch.complextfloat, which may cause additional overhead

#

also curious, is there any Linearized Attention models scaling up?

void quartz
#

i dun think they changed the RWKV code much - imo, cause there isn't a reason to do so

I guess it boils to the definition of how you define parallelization. This is currently my understanding on how RWKV runs in parallel.

x axis, is tokens, y is the layers somewhat, orange is layer norm, purple is time mix, green is channel mix

#

like strictly speaking everything past the first layer norm, does depend on the previous tiles - so if you define parallelization as being able to "compute independently" of other tokens then yes - we are a "not parallelizable" in that regard i guess?

even though in practise RWKV is still able to rapidly ramp up, and saturate the GPU across the multiple layers

#

which fits my understanding of "training parallelization" where it is more of "can we split the training process of a single data sample into enough threads to saturate a GPU" haha

outer vine
void quartz
#

i assume its atleast partially implemented in the main repo, with pytorch / JIT / etc. If not we would never be able to saturate the GPU otherwise

#

( might need to get blink to confirm / deny how it flows in the main repo )

outer vine
#

IMO, computing green box(channel mix) in parallel would be much faster..

#

i have read the source code in the main repo, it is computed sequentially layer-wise and time-wise

void quartz
obsidian quest
void quartz
#

so i guess next step is to ping the author? not sure if they listed the twitter social media account in the paper (probably not?)

obsidian quest
void quartz
#

Haha. I will gladly run some experiments if you let me know the changes

outer vine
#

their code would be released within one week as said in the github repo

outer vine
#

does anyone know much about complex in torch? wouldn't this cause huge latency compared with fp16 with tensorcore?

void quartz
#

For followup paper ideas, to the RWKV paper - would it be best to post it here, or another thread under publishing-help ?

outer vine
#

are there some promising results for rwkv-5?

void quartz
#

Sort of, though its not part of rwkv-5 yet.

I think i will outline it here first (let me know if i should repost this separately). As this is a compilation of an ongoing experiment between me and a few members of the RWKV community.


#

RWKV memory experiment v5/wavenet - update 1

While RWKV is able to match transformer performance on a wide variety of task. It generally stumble upon tasks with large data inputs or randomised datasets, that would need to be compressed and stored within its internal state by the model (Large document Q&A is a major example) - within the RWKV community, this is considered its "only weakness"

As such an ongoing effort to quantify, and benchmark this memory capacity was started, where we measure the model performance on receiving randomised english words token, and replying with said tokens

Instruction: Repeat this text exactly as it is
Input: <random word tokens>
Output: <output to benchmark>

In general transformer models when trained to handle this task has no issue with the lookback and providing a full response (within its context length)

The following is the score for raven / custom rwkv4 models

It is important to stress that this should be considered as worse case scenerio memory capacity, as the raven model has been shown to be able to compress down large common concepts into its memories, far exceeding these numbers.

Randomized text was intentionally chosen, to represent worse case numbers, as training cannot help form a pattern for these text

#

Subsequently with a standardised benchmark we have internally, we came up with the means of training the model from scratch, and to replicate the results - without needing to train an entirely new PilePlus+Raven model

This allowed us to perform experiments into improving memory capacity. The biggest impact as of now is the change to the channel-shift layer, in how tokenshift is done, into a structure that resembles a wavenet. (this is only a few line of changes)

Where we now have a TokenShift 430M model that out performs the raven 14B model in memory recall task - further more this is shown to be scalable upwards, with our TokenShift 1.5B model doubling the 430M performance

#

We are training a slightly larger model (L24-D5120), which we believe will be able to retain into memories more then a 1k tokens. Making this within transformer level context sizes.

It is believed that these modification to raven 14B, would allow it to have perfect recall of 4k tokens (or higher)

(Currently our experiments are bottlenecked by our GPU capacity)

#

We posit that this heightened perfect recall of token memory, at par with transformers context length, could remove the last obstacles preventing RWKV (or other RNN like) architectures from superseding transformers without any compromise.

As it fixes the last set of tasks that it loses out to transformer models in

#

Notes:

The tokenshift memory models trained, have very limited general text model training, we do not know as of now if this process will benefit or hinder subsequent model performance in other task if trained on the pile + etc - the assumption is that it will be an overall benefit. Changes was only done to channel mixing, with time-mixing kept the same. Which we believe will help it retain existing reasoning capabilities.

Since blinks upcoming RWKV-5 changes is only done on the time-mix layer, these changes could potentially be merged and used together.

Currently we do plan to perform memory training and testing on the time-mix rwkv-5 changes, without the tokenshift changes - and subsequently with

#

we drafted the following abstract, and since the members involved have limited to no experience with papers - nor the GPU capacity to take this idea further then memory training (ie. pile+, and instruct tuning)

here i am 😅

void quartz
#

(wavenet architecture, on how the token information flow through the layers)

misty cedar
#

the change is to swap from this causal convolution structure here, to the dilated wavenet above

#

at least for the first 12 layers

void quartz
#

( @misty cedar was the one who made the bulk of the code changes, for these advancements 😉 )

hushed flare
#

Does the wave net still have an RNN form?

misty cedar
#

We have rnn inference code allready written for it

hushed flare
misty cedar
hushed flare
void quartz
#

all the code and the notebook is currently public 👍

void quartz
hushed flare
void quartz
#

abit harder to explain then the simplified table shown : The following scores the output, for each model - x is the input token size that was tested, y is the score (0 means perfect recall)

the table selects the best score based on their respective criteria (among the various tested prompt length)

misty cedar
#

layers above 12 are normal unaltered rwkv layers

hushed flare
#

You may be interested in this paper, seems similar to what you've proposed: https://arxiv.org/abs/2305.01638

The reason I'm asking all these questions is I'm playing around with it.

void quartz
outer vine
#

wow, this is crazy, a linear attention-based model with 175B parameters, which could been trained in parallel and do generation recurrently
https://arxiv.org/abs/2307.14995

young sparrow
#

The only evaluations they do are of partially trained models with 1B parameters or fewer

outer vine
young sparrow
obsidian quest
#

RWKV-5-World-0.1B-v1-OnlyForTest_37%_trained-20230728-ctx4096.pth uploaded https://huggingface.co/BlinkDL/rwkv-5-world/tree/main
supported in rwkv pip package 0.8.7

0.1B world:
RWKV-5 37% trained = LAMBADA ppl 18.1 acc 42.93%
RWKV-4 100% trained = LAMBADA ppl 25.5 acc 36.29%

Interesting fact: RWKV-5 is great at benchmarks (excellent zeroshot performance), but generates quite worse music (just like GPT models) despite lower loss. (try https://huggingface.co/BlinkDL/rwkv-5-music)

This fits my theory: Dot-product is good for uncreative work, while Channelwise is good for creative work.

young sparrow
obsidian quest
#

lets ask @sullen horizon #1103039376184852622 message

void quartz
young sparrow
void quartz
# void quartz # RWKV memory experiment v5/wavenet - update 1 While RWKV is able to match tran...

RWKV memory experiment v5/wavenet - update 2

(please let me know if i should shift this into a seperate thread)

We done a 7.5 / 15% codeparrot dataset train, on both baseline rwkv4 code, and rwkv4+tokenshift to see if the changes have negative impact on the model capability in other task. All 3 are the same 1.5B param models param

From a loss point of view all 3 models converged into similar loss levels, indicating that the token shift changes may not have adverse negative impact on other task

It is also interesting to note that the codeparrot model itself, had an average loss of 2.06 against its validation dataset - meaning all 3 models despite being trained significantly less - may outperform the codeparrot model

Asking for feedback on how to move these changes / experiments forward - aka what are good evals / tasks to train / validate on which would make good use of the extended memory - ideally without needing to train a full model

obsidian quest
hushed flare
#

This looks so much like your code, no? (the RNN form)

obsidian quest
#

AFT = headsz 1 version of LinearTransformer

RWKV4 = ExponentialDecay + Headsz 1

RetNet = ExponentialDecay + Headsz 256, with xPos too (but I find it can be removed)

RWKV5 = ExponentialDecay + Headsz 64, best performance

Headsz N = N x larger state (more vram, slower, still much better than KV cache), helps memorization

#

@hushed flare see #1103039376184852622 message

hushed flare
# obsidian quest

I'm surprised there's so much of a difference -- I would have thought if you heave head size = 1 then you'd end up with multiple heads learning the same kernel if it was beneficial

obsidian quest
#

0.1 loss difference is not too much for a small model. it's similar to the loss of a 30% larger model

void quartz
young sparrow
void quartz
young sparrow
#

Great work, y’all’re getting noticed 🙂

#

At ICML I mentioned RWKV a couple times in a couple conversations and a bunch of people I was talking to knew about it

snow zealot
# young sparrow Do you have RWKV numbers on the Long Range Arena? I’m interested in comparing RW...

Just out of curiosity I trained the RWKV V4 on the code, https://github.com/SSamDav/rwkv-long-range-arena/tree/main, from @sullen horizon on 3 LRA benchmarks (listops, imdb, aan) here, https://wandb.ai/ssamtheboy/lra-benchmark, are the results.
The listops results are a bit sketchy, because I had a run yesterday that performed much better. Probably I need to change the default parameters.
Each run has a note saying each dataset it corresponds.

void quartz
# void quartz Not sure whats the procedure for paper feedback / corrections is - rwkv is cited...

Gotten someone from the RetNet side to clarify what they meant about RWKV not being "training parallelization".

It basically boil down to the fact we need to compute the previous tokens state for the next token state in a data sample in a sequence. And nothing to do with GPU usage - which to be fair - is very true

The full convo is on github here https://github.com/microsoft/unilm/issues/1243

But i hope that helps clear the air on that topic

#

(so put down those pitchforks folks)

spiral minnow
#

Still seems like they are bending the definition. They're method also uses recurrence within blocks and within those blocks, the computation cannot be parallelized (as they define it). So, in my understanding, their method also needs to compute the previous token state in order to calculate the current token state, and thus is also not parallelizable according to that definition

void quartz
# spiral minnow Still seems like they are bending the definition. They're method also uses recur...

😅 yea im still kinda off on the definition.

Cause while it's true that we do not need to have "state" precomputed in transformers.
Dun i also need to compute all the previous tokens in parallel and apply attention, even more so uniquely for every token i generate. Somehow all that additional compute cost is better than having a state between tokens?

(not the expert here, so im gonna take the explanation that its not about throughput as it is)

void quartz
# void quartz # RWKV memory experiment v5/wavenet - update 2 (please let me know if i should s...

RWKV memory experiment v5/wavenet - update 3

Continuing the series of stress testing the v5 (rotary embedding) and v5+wavenet changes - for memory storage and capacity of random words.

We have now officially passed the 1k token mark, with the 1.5B model able to keep upto 1.7k tokens in memory.
Because this is near the current training limit (of 2k), it is possible that the real limit is higher.

Wavenet preview is only 75% trained compared to baseline v5, however its on track to similar performance range (currently 1.5k from testing of the preview - vs - baseline 1.7k)

Tune5 (for both models), which will train it with up to 4k inputs/outputs, is estimated for 48+ more hours

#

However the big thing is, putting the technical progress aside...

RWKV-v5 with or without wavenet, is now officially in transformer territory range of being able to lookback into its inputs

(and hopefully pay attention to them too! which we believe it should, as these changes show no penalty in enwiki/code loss training compared to v4 - once proven, this brings us much closer to having RWKV being a full replacement to transformers with no compromises)

void quartz
#

This is also strong evidence (v4 vs v5) of rotary embeddings, being able to encode and handle relative positional information

snow zealot
#

The problem of using the rotary embedding is then we loose the inf context no?

young sparrow
void quartz
#

there is no absolute positional encoding in it

void quartz
#

( apologies for confusion, blink basically explained to me how v5 changes were not rotary, so it was a misunderstanding on my part when i visualised how the model changes worked - still the numbers are as benchmarked )

obsidian quest
#

yeah my v5 implementation does not have rotary, nor xpos 🙂 it's pos.emb-free

tender karma
obsidian quest
tender karma
#

Thank you! Do you consider it a stable v5 or is really much under experimentation?

obsidian quest
#

stable. can still be improved a bit but requires another CUDA kernel

tender karma
indigo crater
#

very late, but: I would be willing to bet this is a LaTeX addon and it should probably be in overleaf somewhere

void quartz
# tender karma Do you have any description of the current implementation v5? I’ve read so many ...

ignore all the wavenet/tokenshift stuff - those are not in v5.

i found this useful for just extracting the delta:
https://github.com/BlinkDL/RWKV-LM/compare/a637aea61c77cedd290054449d819da5e7b19d44...main

GitHub

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

spiral minnow
#

EMNLP reviews should be coming out in about 1 week. Based on the community's reaction to RWKV, are there any specific feedback that we're expecting from reviewers? Are there any experiments that we can get a head start on now?

young sparrow
#

A scaling laws study that's better designed to draw inference about the param-to-token ratio would be a good idea. We did a pretty good job with the time we had, but I haven't had the bandwidth to figure out exactly what we want yet.

karmic tree
#

I guess we'll see them closer to end 22nd AOE, rather than daytime on the 22nd 🙂 But hopefully they're not late

sharp sonnet
#

The reviews are out!

last mauve
#

Reviews are looking positive to me

#

I'll put up a rebuttal skeleton and revision work list later

karmic tree
#

R Zd3h's first reason to reject establishes a fine anchoring for the paper, I think; if their complaint is an argument that RWKV is not as impactful as the transformer architecture, then things are going well. Always nice to get an "Excitement: Transformative"

young sparrow
karmic tree
gusty condor
#

I don't understand. Just reject because it's a non-Transformer architecture?

outer vine
#

I kind of agree with reviewer 85wr and it would greatly improve the readability of our paper

#

if we could add RKWV-v1 to this paper, it would be a more smooth transition from AFT to RWKV

#

AFT to RWKV-v1: absolute position score to relative decay score (rnn shows up here)
v1 to v4: single decay score to channel-wise decay (plus a lot more like u vector, init method...)

karmic tree
gusty condor
fickle hare
#

I'm not familiar with AI paper reviewing at all, but the "reasons to reject" section looks like "weakness" in Sys/PL conference reviews to me

#

reviewer HNDB even asked for fp16/bf16 training information in their "reasons to reject" part....

tropic minnow