young sparrow May 28, 2023, 8:27 PM

#

How many GPUs are you running on? We can probably spare some dedicated 8xA6000s

obsidian quest May 28, 2023, 8:28 PM

#

8 GPU for each run

young sparrow May 28, 2023, 8:30 PM

#

Let’s see how bad the crashes are and I’ll move things around mid week if needed

tender karma May 29, 2023, 2:01 AM

#

I believe we need clarify a bit better. We have the RWKV which is general purpose RNN than can potentially replace every LSTM-alike in your projects, describing the neural model (e.g, rwkv cell) without even mention the LM task. Then, we say we focus on RWKV-LM, with the LM on top. As pure RNN I started (and then removed) a fundamental experiment section with the basic stuff to evaluate RNNs such as addition, copying tasks, etc. it performed great of course but the LSTM performed almost equally in such simple tasks so there were no point at the end.

fickle hare May 29, 2023, 10:14 AM

#

so with whom should I work on reorganize the section 4?

steady ether May 29, 2023, 1:53 PM

#

@young sparrow I'd be glad to jump in and help with rewriting an initial draft for Section 4. I've run RWKV a few times, but I've always had some questions about the details. I have a knack for breaking down complex concepts which should be useful here.

misty cedar May 29, 2023, 2:18 PM

#

https://www.youtube.com/live/ZDHE119dFR8?feature=share

YouTube

hu-po

RWKV: RNNs Strike Back

Like 👍. Comment 💬. Subscribe 🟥.🏘 Discord: https://discord.gg/uyYQTB7ahttps://arxiv.org/pdf/2305.13048.pdfhttps://github.com/BlinkDL/RWKV-LM#transformers ...

▶ Play video

#

Someone reading iterations the paper

#

Note where he gets confused/misunderstands stuff

tough crane May 29, 2023, 2:24 PM

#

misty cedar https://www.youtube.com/live/ZDHE119dFR8?feature=share

This is an "Open Review"...

misty cedar May 29, 2023, 2:26 PM

#

I'm halfway through, and I am just noting he is very very confused about what token shift is, so it may be worth elaborating on that ig

regal basalt May 29, 2023, 2:26 PM

#

He seems very confused throughout 20-30% of the paper

young sparrow May 29, 2023, 11:48 PM

#

@obsidian quest something seems to be going catastrophically wrong with the runs

obsidian quest May 30, 2023, 3:47 AM

#

young sparrow <@870137517020688415> something seems to be going catastrophically wrong with th...

killed multiple times every day

#

but mostly completed now

tender karma May 30, 2023, 8:05 AM

#

fickle hare so with whom should I work on reorganize the section 4?

Happy to work with you on section 4.

obsidian quest May 30, 2023, 9:22 AM

#

Please download all loss curves in https://wandb.ai/blinkdl/RWKV-v4-Scaling
Use n_embd n_layer my_exit_tokens to identify the run & combine fragments

W&B

blinkdl

Weights & Biases, developer tools for machine learning

neon night May 30, 2023, 11:58 AM

#

@bronze frost Hey, do you think this exp in the cuda backward kernel is not numerically stable:

neon night May 30, 2023, 12:51 PM

#

This is not a problem if we accept small exponentials to be inaccurate in the gradient, since both exp(zexp[i]) and exp(k[i] + o) are less than 1.

steady ether May 30, 2023, 1:55 PM

#

Going to start making some changes to section 4 for clarity. Let me know if you have any suggestions or concerns.

bronze frost May 30, 2023, 2:15 PM

#

neon night This is not a problem if we accept small exponentials to be inaccurate in the gr...

yes, basically if exp(k[ii]+o) is too small for floats to represent, then the outputted gradient gk[ii] isn't going to have any luck representing it either. So then it's fine to underflow to 0.

regal basalt May 30, 2023, 2:32 PM

#

Do we need to clarify the definition for token shift?

young sparrow May 30, 2023, 2:36 PM

#

regal basalt Do we need to clarify the definition for token shift?

Yes, we never explain what it means and it's a non-standard term AFAIK.

regal basalt May 30, 2023, 2:36 PM

#

👌👌

fickle hare May 30, 2023, 3:44 PM

#

steady ether Going to start making some changes to section 4 for clarity. Let me know if you ...

As mentioned above, Section 4 seems to need a rewrite instead of just linguistic improvement. Maybe we should decide the structure of the section first before moving into the details.

serene badge May 30, 2023, 3:46 PM

#

Hello, I would like to extend some help to revise the paper.
Here are some of my immature suggestions. Please correct me if missed some details already covered in the paper.

Lack of explanation for scalability:
We mentioned that RWKV can scale to tens of billions of parameters, but it is not clear how this scalability was achieved.
We could provide more details about how we optimized the model architecture and training process to achieve such scalability.
Insufficient analysis and visualization of attention weights:
While section 4 provides some insights into the interpretability of RWKV's attention mechanism, it would be helpful to include a more detailed analysis and visualization of the attention weights.
We could include visualizations that show how attention weights change over time or across different layers in the model.

fickle hare May 30, 2023, 3:49 PM

#

serene badge Hello, I would like to extend some help to revise the paper. Here are some of my...

We wanted to demonstrate the training scalability similar to transformers in section 4.2, but it seems it's too implicit right now. It's planned to rewrite the whole section so I'll remember that, thanks.
What do you mean by "attention weights"? The decay introduced by Ws?

young sparrow May 30, 2023, 3:50 PM

#

obsidian quest Please download all loss curves in https://wandb.ai/blinkdl/RWKV-v4-Scaling Use ...

On it

serene badge May 30, 2023, 3:55 PM

#

fickle hare 1. We wanted to demonstrate the training scalability similar to transformers in ...

For attention weights, I mean the coefficients assigned to different input elements by an attention mechanism. If I remember correctly, by assigning higher weights to more important elements, the model can selectively attend to the relevant information and ignore the irrelevant parts.

tropic minnow May 30, 2023, 3:57 PM

#

steady ether Going to start making some changes to section 4 for clarity. Let me know if you ...

i think definition of recurrence both as token-shift and as an increasingly longer sum of terms should be mentioned the first time recurrence is described. If anything, i think the WKV is more important for recurrence than token-shift, which is an extra tiny convolution

fickle hare May 30, 2023, 3:59 PM

#

serene badge For attention weights, I mean the coefficients assigned to different input eleme...

So you mean the attention map, consisting of n-to-n numbers per attention head? While we can produce an equivalent plot, it might be less meaningful for variants of linear attention... IDK

young sparrow May 30, 2023, 4:02 PM

#

obsidian quest Please download all loss curves in https://wandb.ai/blinkdl/RWKV-v4-Scaling Use ...

@obsidian quest can you add me to this so I can edit it in app?

serene badge May 30, 2023, 4:03 PM

#

fickle hare So you mean the attention map, consisting of n-to-n numbers per attention head? ...

Kind of heatmaps. Not sure if it is suitable. NVM.

serene badge May 30, 2023, 9:25 PM

#

@fickle hare I made revisions for section 4. Here's the change log.

Fixed some typos in sections 4.4 and 4.5.
Revised sentences in sections 4.2~4.7.
Change the title of section 4.4 "Software Implementation" to "Model Implementation and Architecture".
Suggestion for Figure 2: The font is too small. If the author can provide the original design file, I can help to revise it.

steady ether May 31, 2023, 3:33 AM

#

@tender karma @serene badge @fickle hare @neon night @uneven blade

Looks like everyone really loves section 4. For the rewrite, here's a summary of all the points brought up so far + my own thoughts.

"Infinite" context clarification (1-2 people)
- (Main paper) We should show a math proof, a graph, or at least talk about how this is supported
- (Appendix) @obsidian quest or anyone has time to finetune a 7B model with larger context length and just compare it with other models such as MPT-7B-StoryWriter-65k+, this would be extremely helpful
  - MPT finetuning dataset here. They used a filtered fiction subset: https://huggingface.co/datasets/the_pile_books3
Moving definitions into appendix (1 person)
- We are explaining quite a few things that we could move into the Appendix to save space for more important points
Design clarifications (1-2 people)
- (Main paper) Learning rate, hyper-parameters, optimization techniques
- Expand on the usage of recurrence, time decay, and token shift.
- (Appendix) Elaborate
Editing and coordinating (1 person)
- Review and edit the final work to ensure it flows well.
- Fix abrupt transitions into new concepts.
- Remove repetitive statements.

young sparrow May 31, 2023, 4:11 AM

#

@steady ether you seem to be confused about the_pile_books3. That’s not the MPT training dataset, it’s a small component of it. It’s also a component that is already in our training corpus

serene badge May 31, 2023, 4:21 AM

#

@steady ether, for the definitions and Design clarifications, I'm thinking that we could use a summary table for all the key features we implemented in RWKV. The format could be like this. Then we could move some definitions or explanations to the appendix section.

steady ether May 31, 2023, 4:43 AM

#

young sparrow <@995416401697321032> you seem to be confused about the_pile_books3. That’s not ...

You're totally right. Looks like MPT was fine-tuned on the subset of the books3 dataset, but the base was trained on Pile v2. Edited the post for clarity

Screenshot_2023-05-31_at_12.40.52_AM.png

steady ether May 31, 2023, 4:44 AM

#

serene badge <@995416401697321032>, for the definitions and Design clarifications, I'm thinki...

That's a neat idea. This would certainly help clarify a lot of misunderstandings, especially by people who only kind of understand what's going on. We should aim to make this as easy to understand as possible.

However, it's a big change and we should have buy in from @last mauve, @tender karma and others who have worked on that section

young sparrow May 31, 2023, 4:45 AM

#

steady ether You're totally right. Looks like MPT was fine-tuned on the subset of the books3 ...

No? The base was not trained on Pile V2

#

Please read what you’re linking to before making claims about it

#

It was trained on 1T tokens of text and code that was curated by MosaicML’s data team

steady ether May 31, 2023, 4:54 AM

#

young sparrow > It was trained on 1T tokens of text and code that was curated by MosaicML’s da...

Sorry, yeah you're right. Posting the real data mix here in case I mislead anyone with my previous comment: https://huggingface.co/mosaicml/mpt-7b

Screenshot_2023-05-31_at_12.52.27_AM.png

fickle hare May 31, 2023, 7:30 AM

#

steady ether <@240487524970004491> <@1043027351950327808> <@271623916215074816> <@10420353091...

7B model long context finetuning need some work in the code, e.g. splitting sequence to multiple GPU

#

I can work on the code but I'm not sure if I'll have the GPU-hours to fine-tune on that.

fickle hare May 31, 2023, 7:53 AM

#

serene badge <@995416401697321032>, for the definitions and Design clarifications, I'm thinki...

Some thoughts:

In "Transformer-like Parallelization," we want to mention the following:
a. In our training process, most of the computation (which includes all the matrix multiplication and token shift, only excludes the WKV recurrent operator) is parallelized in the time-axis, similar to Transformers/QRNN/LRU/... but different from GRU/LSTM/...
b. The WKV operator has the potential to be parallelized as well through parallel scan (If the long context finetune is accomplished later, it will become "have been" instead of "can be")
In "RNN-like Sequential Decoding," maybe more explicitly compare with the KV cache of Transformers? Instead of "Sequential," we may want to highlight more about the constant time & space despite the sequence length in the subsection title.

steady ether May 31, 2023, 12:59 PM

#

fickle hare Some thoughts: 1. In "Transformer-like Parallelization," we want to mention the ...

Agreed, especially with #2. We should elaborate and also add some citations here.

fickle hare May 31, 2023, 2:41 PM

#

@obsidian quest Which checkpoint should I start with if I want to replicate the MPT-7B-StoryWriter-65k+ finetune on our 7B? RWKV-4-Pile-7B-20230406-ctx8192-test949.pth?

obsidian quest May 31, 2023, 2:42 PM

#

fickle hare <@870137517020688415> Which checkpoint should I start with if I want to replicat...

yes

serene badge Jun 1, 2023, 12:43 AM

#

@steady ether @fickle hare, I’ve added some citations to section 4.2.

neon night Jun 1, 2023, 7:44 AM

#

Just to close a topic.. I found the following form useful in future extensions of RWKV & relatively easy to compute:

neon night Jun 1, 2023, 8:16 AM

#

For the paper, it probably helps to mention the word "cumulative sum"

tough crane Jun 1, 2023, 8:26 AM

#

Is it called "time span decayed" cumsum ?

neon night Jun 1, 2023, 8:34 AM

#

#

Mind-blowing. If RL is similar to WKV, then a whole bunch of RL techniques can be applied... anyway that's another issue, you write the paper you like

tough crane Jun 1, 2023, 8:45 AM

#

neon night

Exactly !! I'm thinking the same formula 🤣 🤣

neon night Jun 1, 2023, 8:46 AM

#

tough crane Is it called "time span decayed" cumsum ?

One way of calling it is "cumulative attention weight / attention value (?)", similar to "cumulative reward" in RL.

tough crane Jun 1, 2023, 8:58 AM

#

neon night One way of calling it is "cumulative attention weight / attention value (?)", si...

umm, perphaps, we have to have an assumption similar to "RL with infinite horizon" convergence for working "RWKV with infinite context length"

tough crane Jun 1, 2023, 10:33 AM

#

fickle hare Some thoughts: 1. In "Transformer-like Parallelization," we want to mention the ...

If I understand correctly, I think that it's like dynamic programming (time-memory tradeoff using KV cache) vs constant memory transitions (RNN like)

neon night Jun 1, 2023, 10:58 AM

#

neon night

wkv can be seen as maximizing the normalized reward in the direction of the output label

gusty condor Jun 1, 2023, 11:23 AM

#

neon night Mind-blowing. If RL is similar to WKV, then a whole bunch of RL techniques can b...

I think this is only true when gamma is an unlearnable hyperparameter

fickle hare Jun 1, 2023, 12:46 PM

#

tough crane If I understand correctly, I think that it's like dynamic programming (time-mem...

Only if you take FlashAttention into account... otherwise KV cache is not using any more memory than directly computing the matmuls

neon night Jun 1, 2023, 1:05 PM

#

gusty condor I think this is only true when gamma is an unlearnable hyperparameter

thinkies neutrally, wkv could described as a weighted cumulative sum of latent vectors competing for attention, with a learnable exponential decay factor that favors recent vectors.

fickle hare Jun 1, 2023, 1:17 PM

#

@obsidian quest I'm implementing long context training with time checkpointing now, and I need some hints around the L2Wrap thing. It seems to be manually scaling the largest element in each token's output logits, in which the scaling factor is related to the total token amount B*T. Should I keep scaling according to the total token amount, even if it would be much larger (~100K-1M, compared to the previous 10Ks) than before?

gusty condor Jun 1, 2023, 2:49 PM

#

neon night <:thinkies:872847213657735239> neutrally, wkv could described as a weighted cumu...

Personally, I don't think that techniques like double network will work for this, because w is changing over time yet target network is fixed for some time, but you can do some experiments

fervent onyx Jun 1, 2023, 2:53 PM

#

fickle hare <@870137517020688415> I'm implementing long context training with time checkpoin...

I hadn't noticed the L2Wrap before, it looks like it's making the backward pass more numerically stable by down scaling or something? how are you implementing the long context training? are you folding it into batches and distributing across gpus?

fickle hare Jun 1, 2023, 2:56 PM

#

No, I'm not. Given the limited resource, I decided to do gradient checkpointing for every subsequence and chain them together. 4~8*80GB VRAM won't enable 100K~1M ctxlen I want.

neon night Jun 1, 2023, 2:59 PM

#

gusty condor Personally, I don't think that techniques like double network will work for this...

yes that's an issue. but if we can find ways around this, we can produce more papers

#

by fixing the time decay factor (at the fine-tuning stage) for example

fickle hare Jun 1, 2023, 3:18 PM

#

fickle hare <@870137517020688415> I'm implementing long context training with time checkpoin...

The L2Wrap seems to have not been mentioned in the paper either 🤯

obsidian quest Jun 1, 2023, 3:44 PM

#

fickle hare <@870137517020688415> I'm implementing long context training with time checkpoin...

it returns a gradient to make max(logits) closer to 0. the gradient is already scaled

#

It's from PaLM paper (section 5)

steady ether Jun 2, 2023, 5:13 AM

#

@fickle hare I've just revised section 4.4 for clarity.

Could you help clarify these points in section 4.1?

On what basis can we guarantee that linear interpolation will be beneficial in this context?
I noticed the weight output is denoted as Wo. Do you think it would make sense to rename it to Ww for consistency with the RWKV naming?

Screenshot_2023-06-02_at_12.22.39_AM.png

uneven blade Jun 2, 2023, 5:52 AM

#

As per token shift, its benefit is a nontrivial one. In the Hungry Hungry Hippos paper https://arxiv.org/pdf/2212.14052.pdf, they design a "shift matrix" that makes "the state x_i to copy from the input u_i, and then pass that information to the next state x_i+1". They do an experiment of Induction Head showing their architecture narrows the gap between transformers on this task.

We can do a similar experiment to show this: whether a 2-layer RWKV with/without token shift is able to learn the Induction Head task in 100% accuracy.

fickle hare Jun 2, 2023, 7:53 AM

#

steady ether <@271623916215074816> I've just revised section 4.4 for clarity. Could you hel...

If your mentioned "linear interpolation" means the interpolation between $x_t$ and $x_{t-1}$, please refer to the above notes by @uneven blade.
W in RWKV is the decaying parameter in (14), while the $W_o$ here is a weight to linearly project to an output. IMO it should not be renamed to $W_w$.

silent urchinBOT Jun 2, 2023, 7:53 AM

#

Blealtan | Huanqi Cao

tropic minnow Jun 2, 2023, 8:23 AM

#

it is likely we might have to cite this: https://arxiv.org/abs/2305.19370 (block-parallel transformer, twitter thread came out today) as it is a development on top of memory-efficient attention we already cite (raabe & stats 2022...) with applicability to extend context a lot (up to 64k in the paper)

Captura_de_Pantalla_2023-06-02_a_las_10.20.01.png

arXiv.org

Blockwise Parallel Transformer for Long Context Large Models

Transformers have emerged as the cornerstone of state-of-the-art natural
language processing models, showcasing exceptional performance across a wide
range of AI applications. However, the memory demands posed by the
self-attention mechanism and the large feedforward network in Transformers
limit their ability to handle long sequences, thereby c...

fervent onyx Jun 2, 2023, 11:51 AM

#

I've done lots of experiment with the token shift. My main takeaway was that it's playing an important role in token mixing, less so in channel mixing. It's effect in model performance is also non-trivial, in a way that's different for r, k and v. I think it's effect in v has a clean interpretation, but not so for k since it lives in the exponent... The shift could be considered as a tiny convolution layer with kernel size 2 and a softmax (only valid when mixing coeff is positive), when extending it to larger kernel size, i found that it actually made the model more confused than being helpful I think due to these non-trivial effect. If we are doing more experiment, It'll be good to crosscheck these observations...

obsidian quest Jun 2, 2023, 12:23 PM

#

fervent onyx I've done lots of experiment with the token shift. My main takeaway was that it'...

visualize it for K V R and different layers and you will see patterns
larger kernels sz can be useful for byte-level / char-level / audio modeling

soft gull Jun 3, 2023, 3:43 AM

#

Popped up on my recommended: https://youtu.be/x8pW19wKfXQ

YouTube

Yannic Kilcher

RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)

#gpt4 #rwkv #transformer

We take a look at RWKV, a highly scalable architecture between Transformers and RNNs.

Fully Connected (June 7th in SF) Promo Link: https://www.fullyconnected.com/?promo=ynnc

OUTLINE:
0:00 - Introduction
1:50 - Fully Connected In-Person Conference in SF June 7th
3:00 - Transformers vs RNNs
8:00 - RWKV: Best of both wo...

▶ Play video

paper dove Jun 3, 2023, 6:13 AM

#

soft gull Popped up on my recommended: https://youtu.be/x8pW19wKfXQ

cool~

tropic minnow Jun 3, 2023, 9:16 AM

#

@young sparrow @obsidian quest how are experiments for scaling laws going?

tropic minnow Jun 3, 2023, 9:39 AM

#

any progress on this?

#

how is this going @sullen horizon do you need/want help?

obsidian quest Jun 3, 2023, 9:41 AM

#

he has got good LRA numbers and tuning for better

fickle hare Jun 3, 2023, 11:12 AM

#

@steady ether I just went through 4.1 and left several comments there. I feel that reorganizing this subsection is really necessary: it basically mixes all architectural designs in a number of paragraphs without clear sectioning. It should be split into several parts, including 1. *former overall architecture, 2. token-shift for both time & channel -mix, 3. output gating for both time & channel -mix, 4. WKV.

#

Besides, I somehow feel that the current writing is still not perfect, maybe after another editing pass we need to call for others' help

#

Seems it's time to split section 4 into multiple sections...

young sparrow Jun 3, 2023, 12:40 PM

#

tropic minnow <@193204646687408129> <@870137517020688415> how are experiments for scaling laws...

Good, I’ll have plots Monday

fickle hare Jun 3, 2023, 1:08 PM

#

BTW I also remember people commenting on our ArXiV paper about lacking ablation study on the different techniques, including token shift, introducing u in WKV, softmax (exponentials) in WKV, etc.

tropic minnow Jun 3, 2023, 1:26 PM

#

fickle hare BTW I also remember people commenting on our ArXiV paper about lacking ablation ...

Do you think doing a series of small experiments like small init would help?

fickle hare Jun 3, 2023, 1:28 PM

#

IDK, I'm in no way familiar with ML research drinkies

#

(I major in HPC and never really worked on a ML paper like this)

young sparrow Jun 3, 2023, 1:48 PM

#

tropic minnow Do you think doing a series of small experiments like small init would help?

Yes

#

Not essential, but it would be a nice to have

steady ether Jun 3, 2023, 2:02 PM

#

fickle hare <@995416401697321032> I just went through 4.1 and left several comments there. I...

Absolutely. I do have some major changes saved locally which I'll cleanup and share an update today or tomorrow.

I've been looking at those 2 youtube review videos to better understand what people are confused about.

tropic minnow Jun 3, 2023, 2:16 PM

#

fickle hare <@995416401697321032> I just went through 4.1 and left several comments there. I...

i would put WKV right after overall architecture. then token shift

tropic minnow Jun 3, 2023, 2:16 PM

#

fickle hare Besides, I somehow feel that the current writing is still not perfect, maybe aft...

i can revise it anytime you want

fickle hare Jun 3, 2023, 2:16 PM

#

was just mentioning the necessary bits, not in specific order

serene badge Jun 3, 2023, 3:24 PM

#

fickle hare Besides, I somehow feel that the current writing is still not perfect, maybe aft...

I'd like to help with the revision of the section. Will put effort into it.

steady ether Jun 3, 2023, 3:26 PM

#

@fickle hare @serene badge @tropic minnow

Just reworked 4.1. Let me know if it makes more sense now

serene badge Jun 3, 2023, 3:35 PM

#

Will revise Figure 2 to increase the font size today.

fickle hare Jun 3, 2023, 3:44 PM

#

steady ether <@271623916215074816> <@1043027351950327808> <@469771066399784971> Just rework...

It's better, but I think we can do sth more. How about this:
4.1 Architecture Design: Overview of the *former-like structure, residual, time-mix, channel-mix; the end-to-end figure
4.1.1 WKV Operator: describe the attention formula of WKV; mention the existence of recurrent form; insights from AFT, linear RNN, etc.
4.1.2 Token Shift
4.1.3 Output Gating

#

BTW the current 4.2 and 4.3 is too fragmented in the whole paper IMO, should think about put them elsewhere, e.g. in the (new) WKV operator subsection

steady ether Jun 3, 2023, 3:51 PM

#

fickle hare BTW the current 4.2 and 4.3 is too fragmented in the whole paper IMO, should thi...

Sounds good. I was a bit hesitant about adding new subsections because of the page limit. Do you think it's worth shortening 4.7 Additional Optimizations or moving it to the Appendix?

fickle hare Jun 3, 2023, 3:59 PM

#

I think it's worth eliminating 4.2 and 4.3 if we can get the overview to contain necessary information😂

#

4.7 also contains some redundant parts I think

#

the two arch figs and numerous arch formulas might also be unnecessary IMO

steady ether Jun 3, 2023, 4:05 PM

#

Makes sense, I feel like 4.2 and 4.3 only existed to emphasize that RWKV has "the best of both worlds"

tropic minnow Jun 3, 2023, 4:12 PM

#

steady ether Sounds good. I was a bit hesitant about adding new subsections because of the pa...

i think the CUDA kernel paragraph is repeated with a sentence of time-parallel mode. so these kind of repeats could be dedupped to shorten text. also, i would first try to put all the information we want there. we can always come back and shorten / make things more concise

steady ether Jun 3, 2023, 7:21 PM

#

After digging into section 4.1, I began to realize that the order of content might be confusing for some readers. We delve into intricate details and then seem to revert back to higher-level concepts.

If we're open to renaming the headers "RNN-like" and "Transformer-like" to something else. I think we can consider the structure in the 2nd image.

tropic minnow Jun 3, 2023, 7:46 PM

#

steady ether After digging into section 4.1, I began to realize that the order of content mig...

would move eqs 16,17 with 12,13,14

#

i think i like the titles from image with 4.1.1 etc better - they are more objective descriptions and less subjective claims about potential applicability/intention

#

ctx: this one

young sparrow Jun 3, 2023, 8:18 PM

#

What is the point of Figure 1?

young sparrow Jun 3, 2023, 8:23 PM

#

tropic minnow i think i like the titles from image with 4.1.1 etc better - they are more objec...

I like calling out “transformer-like” and “RNN-like” explicitly

#

The thing that strikes me as weird in the current Section 4 is that “Software Implementation” should probably come last

#

It also sorta feels like 4.2 and 4.5 should be combined, or at least consecutive?

#

The section currently is not systematic. It probably doesn’t matter that much what order we go over the material as long as there’s a clear systemic organization

#

RWKV
4.1 Architecture

Keep current content
Compress “4.5 Gradient Stability and Layer Stacking” into a single paragraph and stick it here.
4.2 Transformer-like Training
Keep current content
4.3 RNN-like Inference
Combines “4.3 RNN-like Sequential Decoding” and “4.6 Harnessing Temporal Structure for Sequential Data Processing”
4.4 Additional Optimizations
Keep current content
4.5 Software Implementation
Add a couple mixing citations, such as to DeepSpeed

#

We also need to add the basic info about how the model is trained that is currently missing, like talking about LR decay and providing the h params. That can maybe go in between Sections 4 and 5 along with the scaling laws stuff?

tender karma Jun 3, 2023, 9:05 PM

#

young sparrow 4. RWKV 4.1 Architecture - Keep current content - Compress “4.5 Gradient Stabili...

Fully agree

steady ether Jun 3, 2023, 11:15 PM

#

young sparrow 4. RWKV 4.1 Architecture - Keep current content - Compress “4.5 Gradient Stabili...

Did a high level re-org of those sections and also added citations for DeepSpeed, ZeRO, Megatron-LM. Also addressed comments from @tropic minnow

#

young sparrow Jun 3, 2023, 11:20 PM

#

It looks like the diagram has an error: there’s an extra layer norm at the very beginning

steady ether Jun 3, 2023, 11:23 PM

#

@serene badge You mentioned earlier that you were going to update Figure 2, could you include this?

young sparrow Jun 3, 2023, 11:24 PM

#

It might also be clearer to define $\tilde{x_t}=\mu x_t + (1-\mu)x_{t-1}$ and do Eq 12-16 in terms of $\tilde{x_t}$

silent urchinBOT Jun 3, 2023, 11:24 PM

#

Stella Biderman (she/her)

young sparrow Jun 3, 2023, 11:25 PM

#

Well, I guess that $\mu$ is different between $r/k/v$

silent urchinBOT Jun 3, 2023, 11:25 PM

#

Stella Biderman (she/her)

serene badge Jun 3, 2023, 11:37 PM

#

steady ether <@1043027351950327808> You mentioned earlier that you were going to update Figur...

Sure, will update that.

steady ether Jun 3, 2023, 11:51 PM

#

@fickle hare I haven't really looked at the base code since March. would you mind writing a short paragraph on learning rate/hyperparameters/optimzers in section 4.5. Nothing fancy, just how things are set up right now. We will polish it up later.

A few points I remember that could be relevant

There were some issues with model divergence when we upped the context length, right?
Something about channels decaying at individual rates based on learned weights and activation during inference
There were discussions on LAMB being an possibility, but probably won't be a game-changer. I can't recall the exact reasons though.

serene badge Jun 4, 2023, 3:11 AM

#

@steady ether I have adjusted the font size and changed it to PDF for Figure 2.

steady ether Jun 4, 2023, 3:18 AM

#

serene badge <@995416401697321032> I have adjusted the font size and changed it to PDF for Fi...

Nice! Could we also address Stella's comment on the extra layer norm?

It looks like the diagram has an error: there’s an extra layer norm at the very beginning

serene badge Jun 4, 2023, 3:22 AM

#

Do you mean the extra layer norm after Input embedding in the right figure?

steady ether Jun 4, 2023, 3:29 AM

#

I think it's both that, and also the ones in figure 3. We'll have to address both of these.

serene badge Jun 4, 2023, 3:35 AM

#

OK. I've removed that layer norm in Figure 2. Will revise Figure 3.

serene badge Jun 4, 2023, 4:00 AM

#

I have revised Figure 3 to remove the extra layer norm.

tough crane Jun 4, 2023, 4:51 AM

#

young sparrow What is the point of Figure 1?

To compare parallelized RNNs like RWKV with cell based classical RNNs at the related work section.

fickle hare Jun 4, 2023, 5:19 AM

#

Then it could be moved to otherwhere?

fickle hare Jun 4, 2023, 7:15 AM

#

steady ether <@271623916215074816> I haven't really looked at the base code since March. woul...

I'm not really familiar with that either, though I did read the corresponding codes. Will need Bo to double check that once I finish the initial draft.

fickle hare Jun 4, 2023, 7:21 AM

#

steady ether <@271623916215074816> I haven't really looked at the base code since March. woul...

addressing your mentioned points:

I'm not aware of that, need to ask someone else
What do you mean by this? The channels' decaying rates are trained in the WKV operator, it should have been covered in 4.1
I don't even know what's LAMB 😭

obsidian quest Jun 4, 2023, 7:28 AM

#

young sparrow It looks like the diagram has an error: there’s an extra layer norm at the very ...

RWKV has an extra layer norm after embedding. it's part of [small init emb] trick

fickle hare Jun 4, 2023, 7:41 AM

#

@obsidian quest would you please provide the hyperparameters for training on pile? I'm adding the learning rate/optimizer paragraph.

obsidian quest Jun 4, 2023, 7:44 AM

#

fickle hare <@870137517020688415> would you please provide the hyperparameters for training ...

adam 0.9 0.99, no weight decay, no dropout, bsz 128

fickle hare Jun 4, 2023, 7:46 AM

#

What about the lr_init, lr_final, my_pile_edecay and warmup_steps? I see these are deciding the LR schedule through rather complicated logic.

tropic minnow Jun 4, 2023, 9:04 AM

#

young sparrow It looks like the diagram has an error: there’s an extra layer norm at the very ...

may i ask where was the extra layernorm in the figure that is not in the code?

#

this applies an extra layernorm at the very beginning: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py#L307

GitHub

RWKV-LM/model.py at main · BlinkDL/RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

#

so the first block would have an extra layernorm. correct me if im wrong

Captura_de_Pantalla_2023-06-04_a_las_11.09.19.png

#

if im correct, changes to figures should be rolled back. otherwise, at least fig3 needs a fix here:

Captura_de_Pantalla_2023-06-04_a_las_11.10.14.png

#

the small init embedding (embeddings to 1e-4 and LN afterwards - (then whatever residual blocks) is also described here: https://github.com/BlinkDL/SmallInitEmb#smallinitemb)

GitHub

GitHub - BlinkDL/SmallInitEmb: LayerNorm(SmallInit(Embedding)) in a...

LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence - GitHub - BlinkDL/SmallInitEmb: LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

tropic minnow Jun 4, 2023, 9:31 AM

#

@steady ether @serene badge @young sparrow ^^

fickle hare Jun 4, 2023, 9:34 AM

#

is this really the case? I don't think the current training code implements multi-gpu tensor parallel as Megatron did

tropic minnow Jun 4, 2023, 9:48 AM

#

@last mauve 👀 👀 👀 i think accessing the full history can have many advantages for resolving unexpected changes and tracking progress over time. could we have it? if you dont want to spend money on this, i think @young sparrow had the paid version so we could transfer ownership. also could we add tracking changes to see who's the author of what modifications?

Captura_de_Pantalla_2023-06-04_a_las_11.45.18.png

tropic minnow Jun 4, 2023, 9:51 AM

#

fickle hare is this really the case? I don't think the current training code implements mult...

yea i dont think we use model-parallel

young sparrow Jun 4, 2023, 11:33 AM

#

obsidian quest RWKV has an extra layer norm after embedding. it's part of [small init emb] tri...

Oh wild

young sparrow Jun 4, 2023, 11:33 AM

#

tropic minnow <@367104793292046338> 👀 👀 👀 i think accessing the full history can have many ...

Yeah if you transfer it to me that would work

obsidian quest Jun 4, 2023, 11:49 AM

#

fickle hare What about the `lr_init`, `lr_final`, `my_pile_edecay` and `warmup_steps`? I see...

warmup steps ==> 10 steps, only because i am not saving optimizer state

young sparrow Jun 4, 2023, 11:53 AM

#

obsidian quest warmup steps ==> 10 steps, only because i am not saving optimizer state

Why are you not saving optimizer states

tropic minnow Jun 4, 2023, 11:54 AM

#

young sparrow Oh wild

ok im reverting changes to fig2, 3. @serene badge im using a larger font size for fig2 as u did.

young sparrow Jun 4, 2023, 11:56 AM

#

tough crane To compare parallelized RNNs like RWKV with cell based classical RNNs at the rel...

It’s awkwardly situated then, given that it’s pretty far away in the text from the related work section. I also suspect we’ll need to cut of substantially decease that section for length considerations in the end.

obsidian quest Jun 4, 2023, 12:49 PM

#

young sparrow Why are you not saving optimizer states

not enough fsx space lol

#

i find it's fine

tough crane Jun 4, 2023, 12:54 PM

#

young sparrow It’s awkwardly situated then, given that it’s pretty far away in the text from t...

fig 1 is also refered in section 3.1 to compare RWKVs with RNNs. Do you wanna remove texts about comparisons with RNNs in related works and section 3.1 ? Do you wanna remove only fig 1 because of its space consumption?

serene badge Jun 4, 2023, 1:39 PM

#

tropic minnow ok im reverting changes to fig2, 3. <@1043027351950327808> im using a larger fon...

Thank you for the update. I should check the code before editing. The font size looks better now.

young sparrow Jun 4, 2023, 2:22 PM

#

tropic minnow <@367104793292046338> 👀 👀 👀 i think accessing the full history can have many ...

And yes, if you transfer ownership to me we will get full project history

young sparrow Jun 4, 2023, 2:29 PM

#

tough crane fig 1 is also refered in section 3.1 to compare RWKVs with RNNs. Do you wanna re...

My issues with it are (in decreasing order of importance):

It doesn’t add anything to my comprehension and others have said the same
It’s specially disconnected from its references which makes reading the paper harder
I think it makes the aesthetics of the page layout worse.

Separately, I anticipate needing to cut it for space concerns

young sparrow Jun 4, 2023, 2:29 PM

#

tough crane fig 1 is also refered in section 3.1 to compare RWKVs with RNNs. Do you wanna re...

My issues with it are (in decreasing order of importance):

It doesn’t add anything to my comprehension and others have said the same
It’s specially disconnected from its references which makes reading the paper harder
I think it makes the aesthetics of the page layout worse.

Separately, I anticipate needing to cut it for space concerns

tough crane Jun 4, 2023, 2:36 PM

#

young sparrow My issues with it are (in decreasing order of importance): 1. It doesn’t add any...

I wonder if you want to remove some paragraphs or sections. I agree with removing fig 1.

young sparrow Jun 4, 2023, 2:39 PM

#

tough crane I wonder if you want to remove some paragraphs or sections. I agree with removin...

I want to move section 2 to the appendix, pretty much. Maybe incorporate some of its contents elsewhere

young sparrow Jun 4, 2023, 2:40 PM

#

obsidian quest i find it's fine

this must be disclosed in the paper

obsidian quest Jun 4, 2023, 2:44 PM

#

young sparrow **this must be disclosed in the paper**

because all runs are killed multiple times due to server issues

tough crane Jun 4, 2023, 2:44 PM

#

young sparrow I want to move section 2 to the appendix, pretty much. Maybe incorporate some of...

Do others say to move section 2 to the appendix?

tropic minnow Jun 4, 2023, 2:54 PM

#

young sparrow And yes, if you transfer ownership to me we will get full project history

yea project is not mine, i think it's @last mauve 's

young sparrow Jun 4, 2023, 3:03 PM

#

obsidian quest because all runs are killed multiple times due to server issues

That’s fine, but this must be disclosed in the paper

#

And makes me worried that there are other things that need to be disclosed in the paper that I haven’t caught yet

tropic minnow Jun 4, 2023, 3:05 PM

#

tough crane Do others say to move section 2 to the appendix?

i don't think we should move the whole of section 2 to appendix. the works described there can be very relevant to readers as they share common objectives with ours. perhaps we could simplify it or move the less relevant part. There's also some work of deduplication to be done, for example this sentence (^attached^) which should go in 3.2 at least (just moved).

Captura_de_Pantalla_2023-06-04_a_las_16.58.19.png

#

i agree w @young sparrow on moving figure 1 out of the current place (and placing in appendix or hiding completely,). It's odd the first figure of a paper introducing a novel architecture adds so little to what this arch really is. Especially when figure 3 for example would be much more pleasant to the eye and help a lot more to understand whats RWKV.

young sparrow Jun 4, 2023, 3:06 PM

#

tropic minnow i don't think we should move the whole of section 2 to appendix. the works descr...

I think that some of this content should be viewed as essential, but can easily be moved to the introduction or another section. Also, the passage you highlight is already in Section 4

young sparrow Jun 4, 2023, 3:24 PM

#

@obsidian quest So I’m visualizing the data for the scaling laws

#

And I can slice the data by model size

#

But how do I distinguish between runs that ran for different numbers of tokens?

tropic minnow Jun 4, 2023, 3:36 PM

#

young sparrow And makes me worried that there are other things that need to be disclosed in th...

it is likely... lets continue inspecting things carefully and reporting missing information. ideally we would like all experimental results to have reproducibility instructions and potentially open source code.

#

ping @paper dove

steady ether Jun 4, 2023, 3:47 PM

#

tropic minnow it is likely... lets continue inspecting things carefully and reporting missing ...

I think we still need the following:

Mention hardware info for reproducibility
Maybe an Ethics Statement?
- Guidelines (https://2021.emnlp.org/call-for-papers/ethics-faq)
- Mention misuse potential
- I think we fall under "experiments that involve lots of compute time/power"
- An example from another EMNLP paper: https://aclanthology.org/2022.emnlp-main.42.pdf

Ethics FAQ | EMNLP 2021

#

Screenshot_2023-06-04_at_11.51.40_AM.png

tropic minnow Jun 4, 2023, 3:52 PM

#

steady ether I think we still need the following: * Mention hardware info for reproducibility...

thx. yes will be working on a draft for that later today

last mauve Jun 4, 2023, 4:23 PM

#

tropic minnow yea project is not mine, i think it's <@367104793292046338> 's

Sorry for the radio silence. Had another paper deadline.

#

@young sparrow -- Sent you an overleaf invite. Once you accept I can promote you to owner

obsidian quest Jun 4, 2023, 4:29 PM

#

young sparrow But how do I distinguish between runs that ran for different numbers of tokens?

Use n_embd // n_layer // my_exit_tokens to identify the run & combine fragments

young sparrow Jun 4, 2023, 4:34 PM

#

obsidian quest Use n_embd // n_layer // my_exit_tokens to identify the run & combine fragments

What is my_exit_tokens? Is that the target train length?

young sparrow Jun 4, 2023, 4:37 PM

#

last mauve <@193204646687408129> -- Sent you an overleaf invite. Once you accept I can prom...

Done

last mauve Jun 4, 2023, 4:38 PM

#

young sparrow Done

Transferred

obsidian quest Jun 4, 2023, 5:39 PM

#

young sparrow What is my_exit_tokens? Is that the target train length?

target train length. you can filter by it in wanbd

young sparrow Jun 4, 2023, 5:47 PM

#

Perfect

steady ether Jun 4, 2023, 7:17 PM

#

tropic minnow yea i dont think we use model-parallel

Removed

#

~~@fickle hare Is this accurate? Maybe worth using more precise language and also a mention in your paragraph.~~

~~I thought we initialized most of the matrices to zero (at least in the March version)~~

#

I guess zero is a small value, huh? 😄

#

Nevermind, I was looking at the wrong part of the code

spiral minnow Jun 4, 2023, 8:42 PM

#

Question about equation 11: If we're summing from i=1 to t-1, should the integer in the parenthesis be (t-i)?
If we sum from i=1 to t-1 and use (t-1-i), then the final element of the sum will be (t-1-(t-1))=0, is that on purpose? My understanding is that the final element should attend to the previous token, so it should be (t-(t-1)) = 1

broken moth Jun 4, 2023, 8:49 PM

#

this part (Appendix C) should probably be corrected, I left a comment

tropic minnow Jun 4, 2023, 8:51 PM

#

spiral minnow Question about equation 11: If we're summing from i=1 to t-1, should the integer...

the immediate previous token goes through the u weight, not through the w weight

young sparrow Jun 4, 2023, 8:55 PM

#

tropic minnow the immediate previous token goes through the `u` weight, not through the `w` we...

Is there a way to make that go away notationally? Like, if we set u = w_t does that cause problems elsewhere?

#

The separate weight for the current token throws me every time I look at the equation

tropic minnow Jun 4, 2023, 8:59 PM

#

young sparrow The separate weight for the current token throws me every time I look at the equ...

but it is actually what happens in the code. theres a time-associated parameter for all positions except for the immediate previous one (see: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py#L186). we're just describing there. perhaps we could mention it in the line below that w_t gets its own set of parameters?

Captura_de_Pantalla_2023-06-04_a_las_22.58.25.png

GitHub

RWKV-LM/model.py at main · BlinkDL/RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

young sparrow Jun 4, 2023, 9:00 PM

#

tropic minnow but it is actually what happens in the code. theres a time-associated parameter ...

Yeah I mean… maybe the current one can be set to 1?

tropic minnow Jun 4, 2023, 9:01 PM

#

young sparrow Yeah I mean… maybe the current one can be set to 1?

hmmm i dont think so, at least not without affecting performance. the U is there intentionally bc the time-association of W could be too strong of an inductive bias. i think it could be great if we did a small experiment like the SmallInitEmbedding with this difference and see.

#

thats why i keep asking for this #1103039376184852622 message @paper dove

spiral minnow Jun 4, 2023, 9:08 PM

#

tropic minnow the immediate previous token goes through the `u` weight, not through the `w` we...

Somehow that doesn't make sense to me. If the immediately previous token goes through u, then wouldn't that be represented by t-1? But the equation shows that u is added to the current key and multiplied by the current value, not the previous timestep

young sparrow Jun 4, 2023, 9:10 PM

#

Yeah the current token goes through u I thought

spiral minnow Jun 4, 2023, 9:10 PM

#

Yeah, that's what we're saying in the text as well, "U attends to the current token" (paraphrased)

tropic minnow Jun 4, 2023, 9:16 PM

#

spiral minnow Question about equation 11: If we're summing from i=1 to t-1, should the integer...

hmm okay so you're asking the immediate previous instead of the current, sorry. so yes i think it's on purpose? it is described as well from here: https://johanwind.github.io/2023/03/23/rwkv_details.html

Captura_de_Pantalla_2023-06-04_a_las_23.15.36.png

The Good Minima

How the RWKV language model works

I go through and explain a minimal implementation of RWKV in detail.

spiral minnow Jun 4, 2023, 9:16 PM

#

Okay, I get it now. Seems a bit complicated, but maybe that's just how it needs to be for the model to work. Maybe there's a nicer way to write it though, I'll think about it

spiral minnow Jun 4, 2023, 9:33 PM

#

Wow, the equations are really throwing me off. The current key is weighted by U, the previous key is unweighted, and the key from 2 timesteps in the past is weighted by W. Is that correct? I guess it makes sense but just seems very unusual

#

And by weighted, I really just mean that it gets a bias added to it so that it's actually scaling the value

fervent onyx Jun 4, 2023, 9:35 PM

#

Yes that's correct, it's equivalent to if you also reweight the previous token - you just multiply numerator and denominator by exp(w) and rewrite u as log(exp(u+w)-1)

tough crane Jun 5, 2023, 3:39 AM

#

tropic minnow i don't think we should move the whole of section 2 to appendix. the works descr...

Removed Fig 1. and links to this fig. Section 2 still remains.

steady ether Jun 5, 2023, 4:49 AM

#

Just made some grammar/spelling fixes. However, the Future Work/Conclusions section might need a rewrite. Also spotted that we're using abbreviations like 'LLM' without defining them upfront.

Screenshot_2023-06-05_at_12.20.59_AM.png

Screenshot_2023-06-05_at_12.24.13_AM.png

uneven blade Jun 5, 2023, 7:44 AM

#

@tropic minnow@fickle hare After reading section 4.1, I feel like it might be better to have the order Token Shift -> WKV -> Time/Channel Mixing and Output Gating, because the r, k and v vectors used in WKV and others are defined in Token Shift and I feel lost when first seeing the WKV using these. Also, this is the logical order of how things are computed...

fickle hare Jun 5, 2023, 7:45 AM

#

uneven blade <@469771066399784971><@271623916215074816> After reading section 4.1, I feel lik...

@tropic minnow What's ur opinion?

fickle hare Jun 5, 2023, 8:21 AM

#

obsidian quest warmup steps ==> 10 steps, only because i am not saving optimizer state

still unclear to me if the linear schedule or exponential schedule is used, I see it's deciding according to if there's a zero in lr_init/final?

#

also my_pile_edecay decides when to start decaying

tropic minnow Jun 5, 2023, 9:00 AM

#

fickle hare <@469771066399784971> What's ur opinion?

hmm my take was that token shift is a tiny conv we add to increase performance, whereas main RNN-like properties come from (R)WKV, which is the "attention replacement" we implement and what people might be looking for when they read "a replacement to transformers". However @uneven blade 's point about (time-mixing) token-shift preceding the WKV computation is fair. I think we can go that way if others prefer it too, as long as we're systematic in the description of components it should not matter much

fickle hare Jun 5, 2023, 9:11 AM

#

Personally I think as long as we highlight the WKV as a replacement for self-attention in the overview before we start diving into details, it will be fine

obsidian quest Jun 5, 2023, 9:21 AM

#

fickle hare still unclear to me if the linear schedule or exponential schedule is used, I se...

see LR history for everything: #1083107245971226685 message

fickle hare Jun 5, 2023, 11:45 AM

#

tropic minnow hmm my take was that token shift is a tiny conv we add to increase performance, ...

Would you think describing the token shift as a 1:3 depthwise convolution with kernel size 2 is a good idea?

fickle hare Jun 5, 2023, 1:08 PM

#

added a paragraph at the end of 4.5 describing details about loss, learning rate, and optimizer.

#

need to summarize the hyperparameters later in the appendix

tropic minnow Jun 5, 2023, 1:30 PM

#

fickle hare Would you think describing the token shift as a 1:3 depthwise convolution with k...

hmmm i wouldnt try to push kernel fusion (1:3 for time-mixing and 1:2 for channel mixing) thoughts into the RWKV announcement paper since it is not even implemented like that in the code. Maybe a note saying "intuitively, this can be seen as a small convolution" of kernel size 2 or something. but i think there's already smth like that here:

Captura_de_Pantalla_2023-06-05_a_las_15.29.35.png

young sparrow Jun 5, 2023, 10:58 PM

#

The paper currently says

It is noteworthy that FLOPs are independent of the context length, unlike regular transformers.
This is false though? Transformer FLOPs is given by 6PD, no term for the context length.

#

Actually all of Appendix B makes little sense. The equations are self-contradictory, we present what I think are supposed to be three different approximations, and an omission of the number of data points entirely.

#

If it's the case that RWKV FLOPs are well approximated by 6PD (just like a transformer) we should derive that and just stop.

#

The text I'm primarily referring to is:

The number of parameters for each model is computed using the formula: $#parameters = 2VD + 13D^2L + D(11L+4)$ where $V$ = 50277 is the vocabulary size, $D$ represents the Model Dimension and $L$ corresponds to the number of layers.

FLOPs is for a forward pass for one token. It was calculated as $6(VD + 13D^2L)$, which is the twice (add and multiply) the number of parameters in linear layers. The backwards pass FLOPs can be approximated as twice that of the forward pass. So the total is $6(VD + 13D^2L)$ per token for training (3x fw FLOPs). It is noteworthy that FLOPs are independent of the context length, unlike regular transformers. The FLOP approximations in this paper are in line with the methodology used by Kaplan et al. (2020).

silent urchinBOT Jun 5, 2023, 11:06 PM

#

Stella Biderman (she/her)

fickle hare Jun 6, 2023, 12:35 AM

#

I think it's pointing the second term

#

Okay i think it's correct. Counting Transformer flops per token involves computing self attention against history KVs, which has a FLOPS linear to the history size

#

Why are you only counting the 6PD from the head?

#

#

(yet the first 6 should be 2 I think

young sparrow Jun 6, 2023, 12:52 AM

#

fickle hare

Read the next paragraph

#

I believe that 6PD is a good approximation for total training FLOP for both models

fickle hare Jun 6, 2023, 12:56 AM

#

fine. with not really long context attention flops are negligible

fickle hare Jun 6, 2023, 12:58 AM

#

young sparrow I believe that 6PD is a good approximation for total training FLOP for both mode...

It's still not the case. 12LD^2 is much larger than PD.

young sparrow Jun 6, 2023, 12:58 AM

#

Sorry my D is “dataset size”

#

Not “hidden dimension size”

#

So in per-token units this would simply be 6P

fickle hare Jun 6, 2023, 12:59 AM

#

oh i see

#

Parameters

#

got it wrong

young sparrow Jun 6, 2023, 12:59 AM

#

Which is what the text (but not equations) of the passage I quoted says

fickle hare Jun 6, 2023, 1:03 AM

#

then the problem is whether to mention the square yet smaller term in transformers flops

#

for transformer it's 'approximate' since it throws the context-growing term away, but for us it's accurately 6P per token

young sparrow Jun 6, 2023, 1:05 AM

#

What about the D(11L+4) term? It goes away, and I assumed that’s because of the same kind of reasoning

fickle hare Jun 6, 2023, 1:06 AM

#

it's the token shift and wkv parameters I think

#

okay it's not calculating the elementwise muls and adds now...

#

but they are all constant for each token

#

it's missing and I'll do some calculation for wkv and add that

young sparrow Jun 6, 2023, 1:14 AM

#

I really don’t think having it exactly matters

fickle hare Jun 6, 2023, 1:19 AM

#

yeah it's negligible compared with the linear layers

#

It's just... the omitted term for us is constant while for transformer is linear to context length

young sparrow Jun 6, 2023, 1:37 AM

#

That’s not a real difference

#

It doesn’t make us look better to point it out, it makes it look like we don’t know what matters.

fickle hare Jun 6, 2023, 1:37 AM

#

I agree

paper dove Jun 6, 2023, 3:00 AM

#

I have seen some people questioning the initialization settings in RWKV. “Initialization of parameters in the popular RWKV model is done by setting all parameter matrices to zero. It is claimed that this approach avoids the noise introduced during the initial learning phase. However, this practice is highly unreasonable. Initializing parameters to zero can lead to issues such as symmetry problems, vanishing gradients, lack of diversity, and slow convergence speed. In small models, zero initialization is rarely used. Instead, methods like Glorot initialization and Kaiming initialization are commonly employed.”

young sparrow Jun 6, 2023, 3:20 AM

#

paper dove I have seen some people questioning the initialization settings in RWKV. “Initia...

I think that the first thing to do is examine Blink’s code and see if it actually initializes everything to 0.

steady ether Jun 6, 2023, 3:30 AM

#

Screenshot_2023-06-05_at_11.29.31_PM.png

#

There's also a section under How it works mentioning this.

Screenshot_2023-06-05_at_11.30.58_PM.png

paper dove Jun 6, 2023, 3:41 AM

#

It seems that this approach is counterintuitive for many people, and perhaps it requires more explanation or persuasion. @obsidian quest

steady ether Jun 6, 2023, 3:45 AM

#

I vaguely remember this discussion from a past conversation. I believe the key point was that because sigmoid(0) equals 0.5, the weights are able to be updated

#

But yes, more clarification on this point would certainly be good.

paper dove Jun 6, 2023, 3:58 AM

#

steady ether I vaguely remember this discussion from a past conversation. I believe the key p...

if it is due to sigmoid, maybe this initialization is not general, it highly depends on RWKV design

fickle hare Jun 6, 2023, 5:15 AM

#

maybe also ablation study? initial iterations on small models would be sufficient

steady ether Jun 6, 2023, 6:45 AM

#

@tropic minnow I've added the ethics statement that I mentioned earlier. Feel free to review and tweak as needed.

obsidian quest Jun 6, 2023, 9:46 AM

#

paper dove It seems that this approach is counterintuitive for many people, and perhaps it ...

I initialize some matrices to zero (not all of them)
only these are initialized to zero:att.key att.receptance att.output ffn.value ffn.receptance
For each timemix/channelmix block we just need randomness in one matrix
namely: att.value ffn.key and this is enough to provide gradients

Note e^0 = 1, sigmoid(0) = 0.5 so the design is related to RWKV

obsidian quest Jun 6, 2023, 9:47 AM

#

paper dove I have seen some people questioning the initialization settings in RWKV. “Initia...

he thought i initialize everything to zero

tropic minnow Jun 6, 2023, 11:49 AM

#

paper dove I have seen some people questioning the initialization settings in RWKV. “Initia...

Initializing residual tracks to 0 makes sense so model starts at identity. Only weights that need to be different than 0 due to symmetry otherwise should be nonzero. Starting at identity (see clean path reference in the text) makes learning faster and better.

young sparrow Jun 6, 2023, 4:50 PM

#

I cannot find the bug in my scaling laws code

#

I run this

l = defaultdict(list)
for d in df.keys():
    x = d.split(" ")
    loss = float(df[d].sort_values('Gtokens').tail(1)['loss'])
    layer = int(x[0][1:])
    dim = int(x[1][1:])
    print(layer, dim)
    print(params(layer, dim))
    print("---")
    tok = float(x[2])
    l['L'].append(layer)
    l['D'].append(dim)
    l['T'].append(tok)
    l['loss'].append(loss)
    l['params'] = params(layer, dim)
    l['compute'] = 6 * params(layer, dim) * tok
df = pd.DataFrame(l)

which prints out the expected thing:

#

The very next cell does this though

Screen_Shot_2023-06-06_at_12.51.42_PM.png

fickle hare Jun 6, 2023, 4:53 PM

#

params and compute columns wrong?

#

is params a pure function?

#

ah i see

#

instead of

    l['params'] = params(layer, dim)
    l['compute'] = 6 * params(layer, dim) * tok

do

    l['params'].append(params(layer, dim))
    l['compute'].append(6 * params(layer, dim) * tok)

young sparrow Jun 6, 2023, 5:04 PM

#

oooo

#

Thank you

#

Eyyyy look at that beautiful straight line

Screen_Shot_2023-06-06_at_1.08.30_PM.png

#

(minus the one point which I think is an overflow error)

tropic minnow Jun 6, 2023, 5:26 PM

#

young sparrow Eyyyy look at that beautiful straight line

how does that compare to pythia?

young sparrow Jun 6, 2023, 5:28 PM

#

tropic minnow how does that compare to pythia?

Running the math now

#

Or, "I will run the math after my 1:30 meeting" since I just noticed the time

tropic minnow Jun 6, 2023, 5:34 PM

#

@obsidian quest for the Ethics statement, would be good to know exactly which data has been used to train Raven-14B beyond The Pile

#

current statement describes:

Open Source Data (the pile), publicly available data (raven?)
Open source training codebase and lower inference cost (democratization)
Efficiency in training (effort to lower cost, "sustainable")
Various sizes released (accessible deployment, study of emergent phenomena)
Easier to generate AI text (lower cost Chat assistant, fake news, misinformation)
Potential replication of biases/harmful content in data (but transformer mitigation strategies should work here as well)

obsidian quest Jun 6, 2023, 6:17 PM

#

young sparrow Eyyyy look at that beautiful straight line

can plot all runs (not just the final losses) in the graph

#

do we have any missing runs

young sparrow Jun 6, 2023, 6:55 PM

#

obsidian quest can plot all runs (not just the final losses) in the graph

Unfortunately you can't... Not if you want to get the correct results.

#

For example, the equation I'm getting is quite different from the ones the original experiments had. This is the original experiment

Screen_Shot_2023-06-06_at_3.03.40_PM.png

#

Hmmm I think my code might have a bug.

steady ether Jun 6, 2023, 7:11 PM

#

tropic minnow <@870137517020688415> for the Ethics statement, would be good to know exactly wh...

Agreed. This could significantly enhance the open-source/reproducibility aspects of our project.

I believe we used these resources, but it would be great if @obsidian quest could confirm:

https://github.com/tatsu-lab/stanford_alpaca
https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k
https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
https://huggingface.co/datasets/nomic-ai/gpt4all_prompt_generations
https://github.com/lm-sys/FastChat/issues/90#issuecomment-1493250773

#

Also, knowing the exact split/iterations would be helpful

young sparrow Jun 6, 2023, 7:46 PM

#

obsidian quest do we have any missing runs

I think we just need more density of runs with slightly different values

#

Here's the data sorted by amount of compute used, and there are clearly runs that are more optimal (17 and 22 are particularly good for example) but there isn't the necessary data density to really get the tradeoffs optimized

Screen_Shot_2023-06-06_at_3.46.47_PM.png

#

This becomes especially obvious when you look at x-axis that aren't "compute"

Screen_Shot_2023-06-06_at_3.50.26_PM.png

Screen_Shot_2023-06-06_at_3.50.55_PM.png

#

I was looking through the Chinchilla paper and found this, which shows all the configs they trained for their paper

Screen_Shot_2023-06-06_at_3.54.38_PM.png

#

What they did was set a total FLOP target and train each model for the number of tokens necessary to reach each target, with 9 targets per model.

#

By contrast we have 7 different models currently

#

So if we can generate more data that would be A+. Just... more models, more # of tokens

#

There is a lower edge to the compute-loss tradeoff currently that's approximately linear. I'm going to try to extract that now

Screen_Shot_2023-06-06_at_4.02.01_PM.png

young sparrow Jun 6, 2023, 8:50 PM

#

It feels like this is the optimal line with the data we currently have

Screen_Shot_2023-06-06_at_4.49.17_PM.png

#

Slope: -0.09467861
Intercept: 1.80843822

Pmzav5dcfMFCFqmJiY0M6dOzVnzhw5jtP0BAEAQPcYY7Rnzx4tXrxYPT21dxo0HRAAAEB0sUkRAABUISAAAIAqBAQAAFCFgAAAAKoQEAAAQBUCAgAAqEJAAAAAVQgIAACgCgEBAABUISAAAIAqBAQAAFCFgAAAAKr8f62cS6zC3zGqAAAAAElFTkSuQmCC.png

young sparrow Jun 6, 2023, 9:21 PM

#

(or in log_10, that's -0.04111839787, 0.7853947398)

#

@obsidian quest to illustrate why this matters, the original value was -0.053. -0.041 vs -0.053 is a huge change

obsidian quest Jun 6, 2023, 9:34 PM

#

why are these two charts different #1103039376184852622 message #1103039376184852622 message

#

we still need to check if we can actually find a 10^5 compute datapoint on your line lol

bronze frost Jun 6, 2023, 9:39 PM

#

young sparrow <@870137517020688415> to illustrate why this matters, the original value was -0....

I was about to comment that you are using flops = parameters * 6 while the previous plot used non-embedding parameters * 6 (like the scaling laws paper, since embedding is no flops). However, I just reran the old plot with flops = parameters * 6 and still get -0.053, so you are right that -0.041 is a huge change.

bronze frost Jun 6, 2023, 9:45 PM

#

young sparrow Actually all of Appendix B makes little sense. The equations are self-contradict...

Also, while I'm here: I wrote that section in a very early draft (I think it was among the first additions after the tex file was created by someone else) as a kind of internal data table for making plots like the scaling laws plots (with the intent that we agree on one of the approximations for the flops, etc.) But it kinda just stayed there I guess. Feel free to remove it / scavenge it for scraps for other sections.

young sparrow Jun 6, 2023, 10:25 PM

#

bronze frost Also, while I'm here: I wrote that section in a very early draft (I think it was...

I’ll probably scrap the appendix section and incorporate parts of it into the scaling laws section.

young sparrow Jun 6, 2023, 10:26 PM

#

obsidian quest why are these two charts different https://discord.com/channels/7297417691927675...

This is due to the aforementioned bug in my code. Happy to upload the notebook for people to inspect but I currently think it’s currently correct.

young sparrow Jun 6, 2023, 10:27 PM

#

obsidian quest we still need to check if we can actually find a 10^5 compute datapoint on your ...

More model / data combos would help with this substantially. Right now I am struggling to tell you how much data and params to use.

young sparrow Jun 6, 2023, 10:28 PM

#

bronze frost I was about to comment that you are using flops = parameters * 6 while the previ...

Have you posted this code? It’s probably worth looking at to make sure we’re doing things the same way. Also I want to steal some of your visual formatting 🙂

bronze frost Jun 6, 2023, 10:32 PM

#

I posted this code, and then @rich raptor made it pretty

#

I helped him find a bug in his code, so I have an old version. Maybe he has a newer one

📎 hack_rwkv_scaling_plots.py

fickle hare Jun 7, 2023, 6:43 AM

#

young sparrow The text I'm primarily referring to is: The number of parameters for each model...

In such cases, what do you think about the current figure 1? It's basically talking about the same thing but in terms of "time complexity".

#

Some comments on 4.4:

Shall we merge "Custom kernels" to "4.1.2 WKV Operator"?
Shall we remove/merge "FFN with R gate" since it's now in "4.1.3 Output Gating"?
I'm curious whether using the abbreviation "init" in "small init embedding" instead of spelling it completed is intentional.
It seems both "Small init embedding" and "Custom initialization" is talking about parameters initialization, except that smallinit requires some architectural design to cooperate with it. If the former two paragraphs are merged to somewhere else, shall we turn the whole section into sth like "Model Initialization"?

ripe tangle Jun 7, 2023, 6:55 AM

#

Hey is this paper still taking helpers?

steady ether Jun 7, 2023, 7:08 AM

#

fickle hare Some comments on 4.4: 1. Shall we merge "Custom kernels" to "4.1.2 WKV Operator"...

No strong feelings, but it might hurt readability since they cover distinct aspects.
Seems reasonable as it improves clarity

tropic minnow Jun 7, 2023, 8:55 AM

#

fickle hare Some comments on 4.4: 1. Shall we merge "Custom kernels" to "4.1.2 WKV Operator"...

Kind of a branding name that has made its way. could rename it for the section title but i'd like to refer it as SmallIinitEmb or SmallInitEmbed throughout the paper for historical reasons (https://github.com/BlinkDL/SmallInitEmb) and bc its a shorter name.

GitHub

GitHub - BlinkDL/SmallInitEmb: LayerNorm(SmallInit(Embedding)) in a...

LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence - GitHub - BlinkDL/SmallInitEmb: LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

#

yes, but in a sense they're quite orthogonal, as smallInitEmb could be applicable to every transformer and was specifically tested (see experiment) whereas the rest of layers are more specific to RWKV and we dont have as hard as a justification for them, simply trial and error during RWKV evolution

obsidian quest Jun 7, 2023, 11:36 AM

#

young sparrow More model / data combos would help with this substantially. Right now I am stru...

could you generate a large graph with N_LAYER-N_EMB-N_TOKENS info for each datapoint 🙂

young sparrow Jun 7, 2023, 11:40 AM

#

@obsidian quest These are the "good" points

Screen_Shot_2023-06-07_at_7.40.30_AM.png

#

Is that what you want?

obsidian quest Jun 7, 2023, 11:43 AM

#

young sparrow It feels like this is the optimal line with the data we currently have

which one has loss ~2.23 (the lowest loss in the graph)

young sparrow Jun 7, 2023, 11:44 AM

#

Lowest loss values

Screen_Shot_2023-06-07_at_7.44.41_AM.png

obsidian quest Jun 7, 2023, 11:45 AM

#

will be great if we can mark L-D-T for each datapoint

young sparrow Jun 7, 2023, 11:45 AM

#

On the plot? That'd be very hard to read

#

I can send you this as a CSV

obsidian quest Jun 7, 2023, 11:45 AM

#

make a very large graph 🙂

#

seems we need T64 experiments

young sparrow Jun 7, 2023, 11:48 AM

#

Here's the CSV with all the points, the color is red if it's on the bottom line I identified

📎 scaling_rwkv.csv

#

Currently sorted lowest to highest loss

#

Tokens are in billions, params in millions

#

compute in units of 10^15 FLOP

#

Oh, here's the points colored red on the scattterplot

#

Oh that's without a log on the y-axis

#

but w/e

#

Gets the point across

#

Screen_Shot_2023-06-07_at_7.55.34_AM.png

#

(note the slope and intercept numbers are different now because these are in log base e while before I was converting to log base 10 since that's what the original work was in.)

obsidian quest Jun 7, 2023, 3:27 PM

#

young sparrow Lowest loss values

loss of 24-1024-16 should be around 2.46 - different from the number in your table

young sparrow Jun 7, 2023, 3:28 PM

#

obsidian quest loss of 24-1024-16 should be around 2.46 - different from the number in your tab...

Cool! let's do it

obsidian quest Jun 7, 2023, 3:30 PM

#

need to smoothen the loss curve before using it @young sparrow

young sparrow Jun 7, 2023, 3:32 PM

#

obsidian quest need to smoothen the loss curve before using it <@193204646687408129>

What do you mean by that?

obsidian quest Jun 7, 2023, 3:34 PM

#

when you download the raw loss curve from wandb, it will be extremely noisy

young sparrow Jun 7, 2023, 3:34 PM

#

obsidian quest loss of 24-1024-16 should be around 2.46 - different from the number in your tab...

How did you estimate this

obsidian quest Jun 7, 2023, 3:34 PM

#

young sparrow Jun 7, 2023, 3:37 PM

#

If I sort all checkpoints from that run by loss I do see that value (and even lower!)

Screen_Shot_2023-06-07_at_11.37.04_AM.png

obsidian quest Jun 7, 2023, 3:38 PM

#

can you plot the loss curve of this run

young sparrow Jun 7, 2023, 3:44 PM

#

#

Why is the loss so noisy

obsidian quest Jun 7, 2023, 3:46 PM

#

because this is the raw loss of each batch

young sparrow Jun 7, 2023, 3:46 PM

#

obsidian quest Jun 7, 2023, 3:46 PM

#

the best method will be to compute a curve fit

young sparrow Jun 7, 2023, 3:46 PM

#

Here it is on a log-log plot

young sparrow Jun 7, 2023, 3:46 PM

#

obsidian quest because this is the raw loss of each batch

transformers don't tend to be this noisy though

obsidian quest Jun 7, 2023, 3:47 PM

#

i am using tiny bsz

#

bsz = 128samples x 1024tokens

young sparrow Jun 7, 2023, 3:48 PM

#

I'll try subsetting to one in every 10 datapoints then

#

EMA isn't helping, neither is subsampling

obsidian quest Jun 7, 2023, 4:00 PM

#

young sparrow EMA isn't helping, neither is subsampling

curve fit

young sparrow Jun 7, 2023, 4:01 PM

#

A linear fit on the log log plot doesn't work

#

What else would you like me to try

obsidian quest Jun 7, 2023, 4:02 PM

#

a linear fit of the last 30 data points

young sparrow Jun 7, 2023, 4:07 PM

#

Line fitted to the last 30 points

#

Everything except the first 50

#

Yeah this simply isn't working

#

Here I tried fitting the line to 30 points near the end of training and then projecting out the next 100

#

obsidian quest Jun 7, 2023, 4:27 PM

#

young sparrow Everything except the first 50

everything except first 50 looks good

young sparrow Jun 7, 2023, 4:30 PM

#

obsidian quest everything except first 50 looks good

Okay zooming in to the last 500 it does actually look better than I had thought

4XAUIIIYSQmIAN4QQQgiJKShuCCGEEBJTUNwQQgghJKaguCGEEEJITEFxQwghhJCYguKGEEIIITEFxQ0hhBBCYgqKG0IIIYTEFBQ3hBBCCIkpKG4IIYQQElNQ3BBCCCEkpqC4IYQQQkhM8f8BI8azwCxbUg8AAAAASUVORK5CYII.png

#

So you want me to fit this line, project it out to the full training, and use that as my loss instead of the observed loss?

#

And re-do the scaling laws experiments?

obsidian quest Jun 7, 2023, 4:30 PM

#

yeah

young sparrow Jun 7, 2023, 4:45 PM

#

@obsidian quest

#

hmm that's kinda misleading as the y axis has changed

#

Hmmm. This looks suspicious

#

#

Variance going up seems like a bad sign

#

(the outlier is from a run that didn't restart, I had been removing it before)

obsidian quest Jun 7, 2023, 5:10 PM

#

pls send me the L-D-T csv

young sparrow Jun 7, 2023, 5:17 PM

#

With the predicted values?

#

Or the real ones

#

Here's the one with the predicted values

📎 predicted_rwkv.csv

obsidian quest Jun 7, 2023, 6:00 PM

#

young sparrow Here's the one with the predicted values

ok seems your code is buggy
for example L6-D512-T32 should be around log(3.01)
and L6-D512-T32 should be around log(3.05)

young sparrow Jun 7, 2023, 6:43 PM

#

obsidian quest ok seems your code is buggy for example L6-D512-T32 should be around log(3.01) a...

You're welcome to take a look

📎 RWKV_Scaling.ipynb

obsidian quest Jun 7, 2023, 8:54 PM

#

#

now doublechecking everything

young sparrow Jun 7, 2023, 9:04 PM

#

It looks like you just rotated my plot and played with the variance lol.

obsidian quest Jun 7, 2023, 9:48 PM

#

fixed version

📎 RWKV_Scaling.ipynb

young sparrow Jun 7, 2023, 9:58 PM

#

young sparrow

That looks a lot like this?

#

I’m our right now but can check it out in a couple hours

obsidian quest Jun 7, 2023, 10:00 PM

#

young sparrow That looks a lot like this?

the numbers are far more aligned with wandb webpage charts because i download up to 50000 datapoints for each run

#

wandb default = only fetch 500 datapts

young sparrow Jun 7, 2023, 10:03 PM

#

I tried to fiddle with that config but it seemed like it wasn’t doing anything

#

😦

#

So I gave up and assumed it didn’t work the way I thought

obsidian quest Jun 7, 2023, 10:03 PM

#

works for me 🙂

#

some runs are very short because they are killed multiple times

young sparrow Jun 7, 2023, 10:07 PM

#

Oh I was doing it inside the API call

#

Whoops

young sparrow Jun 7, 2023, 10:07 PM

#

obsidian quest fixed version

So is this what my code gives now, with all the data?

#

How different is the actual vs predicted numbers

obsidian quest Jun 7, 2023, 10:14 PM

#

#

just more noises in "actual"
note one of the runs lasted longer than T which was before i added exit_after_T to training code

#

so now i am predicting the loss @ T instead of x[-1]

young sparrow Jun 7, 2023, 10:21 PM

#

Interesting. I went with x[-1] because some runs made it within a rounding error of T but not actually T. I had assumed this was because it wasn’t evenly divisible by the batch size, but I guess it was the sampling

ripe tangle Jun 8, 2023, 1:23 AM

#

@obsidian quest Hey is this paper still taking helpers?

young sparrow Jun 8, 2023, 2:46 AM

#

ripe tangle <@870137517020688415> Hey is this paper still taking helpers?

No

steady ether Jun 8, 2023, 3:27 AM

#

@fickle hare I've cited the 5 fine-tuning datasets that I know we used to the ethics statement. Could you double-check to see if I missed anything?

fickle hare Jun 8, 2023, 3:40 AM

#

steady ether <@271623916215074816> I've cited the 5 fine-tuning datasets that I know we used ...

I'm out now, will check ~7hrs later

steady ether Jun 8, 2023, 3:59 AM

#

fickle hare Some comments on 4.4: 1. Shall we merge "Custom kernels" to "4.1.2 WKV Operator"...

Commented out FFN with R Gate

young sparrow Jun 8, 2023, 2:16 PM

#

#

Looking a lot better once @obsidian quest showed me how to fix the data lol

#

(blue points are used for the regression line)

#

Note that both axes have a log on them

#

This gives an exponent of -0.0747

obsidian quest Jun 8, 2023, 2:28 PM

#

young sparrow Looking a lot better once <@870137517020688415> showed me how to fix the data lo...

or you can simply use my datapoints 🙂

📎 message.txt

young sparrow Jun 8, 2023, 2:29 PM

#

I am using your code

#

data

#

I'm just picking up the analysis where you left off

#

These plots worry me though

Screen_Shot_2023-06-08_at_10.31.05_AM.png

Screen_Shot_2023-06-08_at_10.31.15_AM.png

#

The empirically low loss point with a compute value between 12 and 13 is way off of the line for params and tokens too

obsidian quest Jun 8, 2023, 2:37 PM

#

24-2048-1.0 is missing and you can ignore it

#

for some reason, your chart is different from mine #1103039376184852622 message

#

the results basically tell us that we should train larger models for optimal T=32

young sparrow Jun 8, 2023, 2:42 PM

#

No mine is the same, I'm just taking the log of the raw data instead of putting it on a log axis. Here's a log axis

#

(it's slightly distorted due to np.log calling log base e)

#

Here is everything in log base 10

#

Screen_Shot_2023-06-08_at_10.45.25_AM.png

#

Screen_Shot_2023-06-08_at_10.46.32_AM.png

#

Screen_Shot_2023-06-08_at_10.47.20_AM.png

#

Oh there's bunching at 0 due to loss of precision (units of billions and then taking a log). Lemme fix that

fickle hare Jun 8, 2023, 3:09 PM

#

steady ether <@271623916215074816> I've cited the 5 fine-tuning datasets that I know we used ...

I don't know any more english instruction dataset used; firefly, belle, and some other are used for chinese. Need @obsidian quest to double (triple? lol) check though.

steady ether Jun 8, 2023, 3:59 PM

#

Updated. This also made me realize that we didn't mention the multilingual capabilities of RWKV.

obsidian quest Jun 8, 2023, 4:25 PM

#

young sparrow No mine is the same, I'm just taking the log of the raw data instead of putting ...

are you using pred_loss? seems you are using loss (noisy)

young sparrow Jun 8, 2023, 4:33 PM

#

obsidian quest are you using pred_loss? seems you are using loss (noisy)

I am more comfortable using loss than pred_loss, though I'm planning on looking at averaging the loss across several steps next

obsidian quest Jun 8, 2023, 4:34 PM

#

loss is extremely noisy

young sparrow Jun 8, 2023, 4:34 PM

#

I know, but I don't think that fitting a linear model to it is something one should rely on fundamentally.

obsidian quest Jun 8, 2023, 4:35 PM

#

yet it is still a vast improvement

young sparrow Jun 8, 2023, 4:35 PM

#

In what

obsidian quest Jun 8, 2023, 4:35 PM

#

for example, your red datapoint is noise

young sparrow Jun 8, 2023, 5:25 PM

#

@obsidian quest Did you launch more runs?

young sparrow Jun 8, 2023, 5:41 PM

#

The biggest problem is data scarcity. We can hardly call something paredo optimal if there are no other equi-compute points

obsidian quest Jun 8, 2023, 6:43 PM

#

yeah could you find blue points using pred_loss so that i can use the info to launch more runs on pareto front 🙂

young sparrow Jun 8, 2023, 6:46 PM

#

Kk

spiral minnow Jun 8, 2023, 8:48 PM

#

I love the addition of ' to the variables used in channel-mixing!

#

I'm not sure who has access to Figure 1, but I think we should update the variable names in mixing: R', V', K'

#

Happy to do it if somebody can give me the file

young sparrow Jun 8, 2023, 8:59 PM

#

obsidian quest yeah could you find blue points using pred_loss so that i can use the info to la...

Params (B): 0.0625, 0.125, 0.25, 0.5, 1.0

tropic minnow Jun 8, 2023, 9:53 PM

#

spiral minnow I'm not sure who has access to Figure 1, but I think we should update the variab...

ok done🙂

serene badge Jun 9, 2023, 5:10 PM

#

One small comment on Figure 3, we should add legend to x and y axises. Not sure who’s the author. I can help to update the figure. Also, do we need to add error bars for the accuracy scores?

tropic minnow Jun 9, 2023, 7:00 PM

#

serene badge One small comment on Figure 3, we should add legend to x and y axises. Not sure...

@rich raptor

obsidian quest Jun 9, 2023, 8:40 PM

#

@young sparrow use these pred_loss data for the most reasonable fit
ignore L6 and T1 results because they are too different from usual runs

young sparrow Jun 9, 2023, 11:03 PM

#

obsidian quest <@193204646687408129> use these pred_loss data for the most reasonable fit ignor...

Hot damn

#

How do the parameter & dataset curves look?

#

Any less cursed?

tropic minnow Jun 10, 2023, 4:21 PM

#

serene badge One small comment on Figure 3, we should add legend to x and y axises. Not sure...

@serene badge can you take a look at this? #1103039376184852622 message

serene badge Jun 10, 2023, 4:26 PM

#

Cool, I’ll handle it.

serene badge Jun 11, 2023, 4:08 AM

#

tropic minnow <@1043027351950327808> can you take a look at this? https://discord.com/channels...

Figured out how to modify the script to update the figures. In the script, we need to get the data of Pythia and RWKV from the ./RWKV.csv. Where can I find this file?

steady ether Jun 11, 2023, 4:39 AM

#

serene badge Figured out how to modify the script to update the figures. In the script, we ne...

#1103039376184852622 message

Not sure if there's a more updated one

serene badge Jun 11, 2023, 4:49 AM

#

steady ether https://discord.com/channels/729741769192767510/1103039376184852622/110611976486...

Great! Thank you so much! I'll check the newly generated figures to ensure they match the original ones.

serene badge Jun 11, 2023, 5:36 AM

#

I've updated Figure 3. Added legends and changed to PDF format.

last mauve Jun 11, 2023, 8:23 PM

#

Ok it's time to buckle down for EMNLP. I'll be doing regular check-ins like we did for arxiv. Here's what currently needs done:
1. ~~The ethics statement (section 11) needs shortened. No longer than a half page.~~ Nevermind we have the space.
2. @young sparrow and @obsidian quest -- What is the status on your scaling laws work? I assume that'll need to be a new figure/paragraph once finished, or will these just replace the current Figure 5 scaling laws plots?
3. ~~We're currently at about 8.5 pages on an 8-page limit. Should we move section 4.5 Additional Optimizations to an appendix?~~ Nevermind we have the space.
4. Figures 4-6 have strange placement, there's some space at the start of Section 7, and Figure 5 is out of order. These figures should instead be split across pages 6 and 7.
5. ~~Sections 8 (Future Work) and 9 (Conclusions) are very long. We should cut or re-word so that a few lines are reduced.~~ Nevermind we have the space.
6. ~~In Figure 6, we should remove the cuda_ prefixes from each legend entry.~~
7.~~ Result figure captions should be descriptive enough to be self-contained (i.e. easily screenshotted). Figures 3-6 should have their captions updated, but don't make them longer than 2 lines.~~

#

I'm submitting a draft to EMNLP today. Here are the deadlines:
Abstract Deadline: June 16 (Will be submitted today)
Paper Deadline: June 23

last mauve Jun 11, 2023, 8:24 PM

#

last mauve I'm submitting a draft to EMNLP today. Here are the deadlines: **Abstract Deadli...

last mauve Jun 11, 2023, 8:25 PM

#

last mauve Ok it's time to buckle down for EMNLP. I'll be doing regular check-ins like we d...

#

Core author team -- Feel free to add work items to my above list.

spiral minnow Jun 11, 2023, 8:33 PM

#

last mauve Ok it's time to buckle down for EMNLP. I'll be doing regular check-ins like we d...

I was going to offer to re-write the ethics statement, but it seems it's currently 1/2 page. Do you still want it shorter?

#

Also, I re-wrote the future work into paragraphs rather than bullet points, saving 3 lines

last mauve Jun 11, 2023, 8:38 PM

#

spiral minnow I was going to offer to re-write the ethics statement, but it seems it's current...

Ah oops. I meant halved from its current length. It looks way too long to me

spiral minnow Jun 11, 2023, 8:39 PM

#

👍 I don't think they put a space limit on eithics or limitations sections though

last mauve Jun 11, 2023, 8:40 PM

#

@everyone -- If you're an author, I need your email for the EMNLP abstract submission if you haven't sent it to me already.

last mauve Jun 11, 2023, 8:41 PM

#

spiral minnow 👍 I don't think they put a space limit on eithics or limitations sections thou...

No, but we're over the 8-page limit and I'd rather remove from the ethics/conclusion/limitations sections rather than content sections. I'm open to suggestions on where to cut though

spiral minnow Jun 11, 2023, 8:41 PM

#

last mauve No, but we're over the 8-page limit and I'd rather remove from the ethics/conclu...

Ethics and limitations don't count towards the page limit. Let me just get the exact reference information for that

#

Limitations doesn't count towards page limit: https://2023.emnlp.org/calls/main_conference_papers/#mandatory-discussion-of-limitations

EMNLP 2023

Call for Main Conference Papers

Official website for the 2023 Conference on Empirical Methods in Natural Language Processing

#

"Authors will be allowed extra space after the 8th page (4th for short papers) for an optional broader impact statement or other discussion of ethics": https://2023.emnlp.org/calls/main_conference_papers/#ethics-policy

EMNLP 2023

Call for Main Conference Papers

Official website for the 2023 Conference on Empirical Methods in Natural Language Processing

ancient cosmos Jun 11, 2023, 8:43 PM

#

last mauve @everyone -- If you're an author, I need your email for the EMNLP abstract submi...

Ser you pinged everyone in the server

karmic tree Jun 11, 2023, 8:43 PM

#

For me future work has only negative aspects: (1) another valid title for all points under it is "things we didn't do"; (2) it's very rare that things mentioned here are actually done, so they remain as evidence of promises authors made but didn't follow up on. So I always prefer to keep sections like that completely out - big obvious omissions can be mentioned in Limitations

last mauve Jun 11, 2023, 8:47 PM

#

spiral minnow Ethics and limitations don't count towards the page limit. Let me just get the e...

I see. So are we supposed to have a distinct page after Conclusions and before references?

#

This sort of format is new to me so you'll have to bear with me.

spiral minnow Jun 11, 2023, 8:48 PM

#

last mauve I see. So are we supposed to have a distinct page after Conclusions and before r...

Yeah, you can put a \newpage after conclusions. But it's generally a good idea to make sure that the conclusion goes until the very last line of page 8 anyway

spiral minnow Jun 11, 2023, 8:48 PM

#

last mauve This sort of format is new to me so you'll have to bear with me.

No worries. *ACL conferences are really different from the rest of ML

young sparrow Jun 11, 2023, 8:49 PM

#

spiral minnow Yeah, you can put a \newpage after conclusions. But it's generally a good idea t...

I would highly recommend a \newpage for consistent formatting of the subsequent text (otherwise it could bump around plots in the appendix)

obsidian quest Jun 11, 2023, 8:49 PM

#

@young sparrow the plot is even better if we only consider non-embedding params

young sparrow Jun 11, 2023, 8:49 PM

#

obsidian quest <@193204646687408129> the plot is even better if we only consider non-embedding ...

Yeah I ran that locally and didn’t post it yet 🙂

obsidian quest Jun 11, 2023, 8:49 PM

#

after (sry buggy. see below for update) vs before

young sparrow Jun 11, 2023, 8:49 PM

#

It looks really good

last mauve Jun 11, 2023, 8:50 PM

#

ancient cosmos Ser you pinged everyone in the server

wait fr? I thought forum channels were like a thread where only current participants get pinged. I feel like we'd be brigaded by now if I truly pinged everyone.

obsidian quest Jun 11, 2023, 8:50 PM

#

i am running L32-D2560-T16/32/64 (T16 done)

young sparrow Jun 11, 2023, 8:51 PM

#

last mauve wait fr? I thought forum channels were like a thread where only current particip...

In a thread you wouldn’t have pinged everyone, and I think forums work like threads

#

I can confirm you didn’t ping everyone

last mauve Jun 11, 2023, 8:53 PM

#

spiral minnow Yeah, you can put a \newpage after conclusions. But it's generally a good idea t...

Awesome. So ignore all of my space concerns then. Updating my work items to reflect this. (Done)

tropic minnow Jun 11, 2023, 8:57 PM

#

last mauve Ok it's time to buckle down for EMNLP. I'll be doing regular check-ins like we d...

points 2 and 4 will be conditioned on scaling laws word v likely

obsidian quest Jun 11, 2023, 8:59 PM

#

corrected. good fit even for L6 and T1

quaint ingot Jun 11, 2023, 9:00 PM

#

I have a question. If I understand the paper correctly (and maybe i don't), you have explicit bias toward more recent tokens, wouldn't that degrade the result for some model tasks that are not necessarily Languge related? that kind of bias isn't present in transformers.

obsidian quest Jun 11, 2023, 9:04 PM

#

quaint ingot I have a question. If I understand the paper correctly (and maybe i don't), you ...

at least presented in alibi transformers
i think it's okay to introduce some locality bias because that fits most data we care - text, image (if we have 2D RWKV), music, time series, etc.

tropic minnow Jun 11, 2023, 9:08 PM

#

last mauve Ok it's time to buckle down for EMNLP. I'll be doing regular check-ins like we d...

for #7: updated captions for 4, 6. Hopefully its better now. feel free to rephrase. Will do #6 in a few hours (8 for sleeping)

Captura_de_Pantalla_2023-06-11_a_las_23.04.42.png

quaint ingot Jun 11, 2023, 9:09 PM

#

obsidian quest at least presented in alibi transformers i think it's okay to introduce some loc...

I guess that's a fair compromise when your token length is practically infinite. though it'd be interesting if we could balance that bias in different ways

obsidian quest Jun 11, 2023, 9:11 PM

#

quaint ingot I guess that's a fair compromise when your token length is practically infinite....

or use RWKV-5 🙂 complex-valued decay = rotate

outer vine Jun 12, 2023, 7:21 AM

#

maybe we could use this space for emnlp submission?

#

tough crane Jun 12, 2023, 8:23 AM

#

last mauve No, but we're over the 8-page limit and I'd rather remove from the ethics/conclu...

My name : Atsushi Saito, email: [email protected]

tropic minnow Jun 12, 2023, 1:51 PM

#

fig 6 updated to remove "cuda"🙂

young sparrow Jun 12, 2023, 1:52 PM

#

outer vine

It's hard to see why this is happening. The spacing seems unchanged when I remove the author block

tropic minnow Jun 12, 2023, 1:52 PM

#

outer vine

this wont survive in the camera-ready version. but i dont know if that is allowed

tropic minnow Jun 12, 2023, 1:54 PM

#

young sparrow It's hard to see why this is happening. The spacing seems unchanged when I remov...

i think this: \titlebox{6.8cm} is the offender

young sparrow Jun 12, 2023, 1:54 PM

#

I removed that and it didn't fix it either

tropic minnow Jun 12, 2023, 1:57 PM

#

ah sorry \maketitle is the responsible

young sparrow Jun 12, 2023, 2:03 PM

#

That doesn't have much explanatory power. That's the command that tells LaTeX to display the title block, but could mean anything is to blame.

steady ether Jun 12, 2023, 4:21 PM

#

I noticed that in 4.6 we referred to the implementation as RWKV-LM, but later, we go straight into RWKV-4. (it might not be clear to some readers). Perhaps we could change RWKV-LM to RWKV-4, or smoothen the transition?

Also, it may be better to change RWKV to RWKV-4 under Appendix G Inference results to be consistent with the other figures.

Screenshot_2023-06-12_at_12.15.19_PM.png

last mauve Jun 12, 2023, 4:34 PM

#

steady ether I noticed that in 4.6 we referred to the implementation as RWKV-LM, but later, w...

Yeah we never give reasoning for the "4" either. It's weird to claim we're on RWKV-4 for a paper introducing RWKV.

I propose either changing all of these instances to RWKV/RWKV-LM or explaining what RWKV-4 is. Whichever @obsidian quest prefers.

karmic tree Jun 12, 2023, 4:36 PM

#

A bit of negative vspace around the titlebox is generally OK for ACL subs

young sparrow Jun 12, 2023, 4:36 PM

#

My vote is for RWKV or RWKV-LM

karmic tree Jun 12, 2023, 4:37 PM

#

My vote is for RWKV, there isn't a non-LM RWKV and chars take page space

obsidian quest Jun 12, 2023, 4:52 PM

#

yeah just RWKV

young sparrow Jun 12, 2023, 6:46 PM

#

Something appears to be overriding our ability to move the top of the text at all. Even using vpsace won't move it upwards

Screen_Shot_2023-06-12_at_2.46.17_PM.png

tender karma Jun 12, 2023, 8:02 PM

#

karmic tree My vote is for RWKV, there isn't a non-LM RWKV and chars take page space

A bit sad, the network represented best with the Cell in my opinion, is “just” a recurrent network not necessarily connected to a LM

#

For example can be applied as BiRWKV for sequence labelling, it works amazingly well

karmic tree Jun 12, 2023, 8:24 PM

#

tender karma A bit sad, the network represented best with the Cell in my opinion, is “just” a...

Yeah, completely agree the architecture is generalisable. Maybe when that's actually done, the name can change meaning - just like with attention and transformers, which were both designed for & presentd as NMT approaches, then outgrew that task

young sparrow Jun 12, 2023, 8:25 PM

#

tender karma A bit sad, the network represented best with the Cell in my opinion, is “just” a...

What does this mean

#

What is the "language model" that RWKV is "connected" to?

young sparrow Jun 12, 2023, 8:26 PM

#

tender karma For example can be applied as BiRWKV for sequence labelling, it works amazingly ...

... do we have numbers on this? Can we put it in the paper?

mortal latch Jun 12, 2023, 8:32 PM

#

young sparrow Something appears to be overriding our ability to move the top of the text at al...

I have fixed this now by copying the original emnlp2023.sty from official website. Our emnlp2023.sty has been modified to fit many authors into the title section.

tender karma Jun 12, 2023, 8:37 PM

#

young sparrow ... do we have numbers on this? Can we put it in the paper?

I don’t think is worth it. My focus is (well, was) dependency parsing so I just experimented with a variant of https://aclanthology.org/Q16-1023/ replacing the lstm with rwkv. However, it is not so cool anymore this task and it would be not so effective for this paper. Maybe for a follow up subject to show out of the box improvements in old fashioned tasks

ACL Anthology

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Fea...

Eliyahu Kiperwasser, Yoav Goldberg. Transactions of the Association for Computational Linguistics, Volume 4. 2016.

young sparrow Jun 12, 2023, 8:44 PM

#

tender karma I don’t think is worth it. My focus is (well, was) dependency parsing so I just ...

Yeah we could train a suite of BERT-like models and finetune them on standard BERT-applications

#

I know @obsidian quest has talked about doing something like ViT using RWKV too

young sparrow Jun 12, 2023, 8:44 PM

#

mortal latch I have fixed this now by copying the original `emnlp2023.sty` from official webs...

Oh interesting.

last mauve Jun 12, 2023, 8:45 PM

#

mortal latch I have fixed this now by copying the original `emnlp2023.sty` from official webs...

Oh shoot I think I did that for the arxiv submission

#

oops

tender karma Jun 12, 2023, 8:46 PM

#

Got it and agree. I’ve running a “ELMo” variant with rwkv just for fun. Same dataset as the original so benchmark is possible.

outer vine Jun 13, 2023, 1:09 AM

#

young sparrow Something appears to be overriding our ability to move the top of the text at al...

decrease the number here will do \titlebox{6.8cm}

young sparrow Jun 13, 2023, 1:11 AM

#

outer vine decrease the number here will do \titlebox{6.8cm}

No it doesn’t.

outer vine Jun 13, 2023, 5:07 AM

#

uncomment \setlength\titlebox{6.8cm} and decrease the number will do ( I tried), and the current workaround with \begin{comment} is also viable. But i am not so sure if these two methods would violate the requirement of formatting

#

young sparrow Jun 13, 2023, 5:08 AM

#

outer vine uncomment \setlength\titlebox{6.8cm} and decrease the number will do ( I tried),...

We have already fixed this problem, please stop touching it

outer vine Jun 13, 2023, 5:09 AM

#

ok

young sparrow Jun 13, 2023, 3:29 PM

#

Does anyone know if the RWKV implementation in transformers is reliable yet

fickle hare Jun 13, 2023, 3:38 PM

#

bf16 inference and training should be all good now, not sure about fp16 inference

#

yet there are reports on the cuda kernel not successfully compiled... not really reliable yet, use carefully

young sparrow Jun 13, 2023, 7:09 PM

#

It didn’t launch out of the box :/

#

I have runs of BoolQ and MMLU on Pythia / OPT / BLOOM if anyone wants to run the comparison in RWKV

last mauve Jun 15, 2023, 4:48 PM

#

I've submitted a version along with the abstract for EMNLP.

If you did not receive an email from OpenReview: This means you haven't both:
(1) Created an OpenReview account
(2) Sent either me or this channel the email associated with that OpenReview account

If you didn't receive an email, please do these steps by tomorrow. Once we have more authors on the openreview, we can re-order them alphabetically.

last mauve Jun 15, 2023, 5:12 PM

#

@young sparrow and @obsidian quest -- Your scaling laws plots are the last outstanding results. What's the status? What needs done and who can help?

young sparrow Jun 15, 2023, 5:12 PM

#

last mauve <@193204646687408129> and <@870137517020688415> -- Your scaling laws plots are t...

The code Blink sent me with some changes doesn’t work for me, I’m trying to debug it.

obsidian quest Jun 15, 2023, 5:13 PM

#

i sent an excel file with all datapoints

last mauve Jun 15, 2023, 5:15 PM

#

I'm actually pretty happy with the writeup and overall storyline. If anyone knows academics who can give us good feedback, it would be good to receive that.

Lead authors -- Do a pass now and update anything you don't like. If you need help updating, message work items here.

young sparrow Jun 15, 2023, 5:15 PM

#

[we’re talking in DMs]

fickle hare Jun 15, 2023, 6:01 PM

#

last mauve I've submitted a version along with the abstract for EMNLP. **If you did not re...

Just created an account, my email is [email protected]

young sparrow Jun 15, 2023, 6:02 PM

#

last mauve I've submitted a version along with the abstract for EMNLP. **If you did not re...

[email protected]

karmic tree Jun 15, 2023, 6:03 PM

#

last mauve I'm actually pretty happy with the writeup and overall storyline. If anyone know...

Sasha Rush mentioned he was missing some ppl plots/figures from the arXiv draft - not convinced these make sense for cross-architecture comparisons but, that was the feedback

young sparrow Jun 15, 2023, 6:07 PM

#

One complaint I’ve heard is that people don’t think that 6 evaluations are enough anymore. If we can run MMLU, BoolQ, Natural Questions, HellaSwag, TriviaQA, and RACE that would give us a lot more comprehensive of a picture, and most of the plots from the LLaMA paper (missing math stuff we can’t run right now and code evaluations)

#

(We’ve also gotten the same feedback about Pythia)

last mauve Jun 15, 2023, 8:26 PM

#

Just added another batch of authors. Some that I'm still missing:

~~Michael Chung ~~
~~Xuzheng He~~
~~Przemyslaw Kazienko~~
~~Jiaming Kong~~
~~Bartlomiej Koptyra~~
~~Hayden Lau~~
~~Atsushi Saito~~
Bolun Wang
Ruichong Zhang
Qihang Zhao
Peng Zhou
~~Haowen Hou~~

last mauve Jun 15, 2023, 8:27 PM

#

karmic tree Sasha Rush mentioned he was missing some ppl plots/figures from the arXiv draft ...

idk what this means

last mauve Jun 15, 2023, 8:27 PM

#

young sparrow One complaint I’ve heard is that people don’t think that 6 evaluations are enoug...

Is anyone able to pick this up? We'd need these results by 6/23. Need two volunteers

karmic tree Jun 15, 2023, 8:29 PM

#

last mauve idk what this means

PPL is an abbreviation of Perplexity, Sasha is lead scientist at Hugging Face and a Harvard prof, usually gives reasonably strong signal

young sparrow Jun 15, 2023, 8:30 PM

#

karmic tree PPL is an abbreviation of Perplexity, Sasha is lead scientist at Hugging Face an...

Quentin knows all of those things. What he doesn’t know is what Sasha actually wanted us to include.

#

There isn’t such a thing as a ppl plot/figure, and saying we should include one doesn’t mean anything.

karmic tree Jun 15, 2023, 8:53 PM

#

I agree. Let me fish out the tweet

#

Screenshot_2023-06-15-13-55-45-53_e4424258c8b8649f6e67d283a50a2cbc.jpg

obsidian quest Jun 15, 2023, 10:05 PM

#

don't we have 13 evaluations : LAMBADA PIQA StoryCloze16 Hellaswag WinoGrande arc_challenge arc_easy headQA openbookQA sciq triviaQA ReCoRD COPA

young sparrow Jun 15, 2023, 10:08 PM

#

obsidian quest don't we have 13 evaluations : LAMBADA PIQA StoryCloze16 Hellaswag W...

There are six in the main body, are these in the appendix?

obsidian quest Jun 15, 2023, 10:09 PM

#

https://arxiv.org/pdf/2305.13048.pdf table 3 & 4

#

avoid boolq which is very noisy

last mauve Jun 16, 2023, 12:04 AM

#

obsidian quest don't we have 13 evaluations : LAMBADA PIQA StoryCloze16 Hellaswag W...

We do. I just thought Stella's specific evals are the current "llm scoring meta"

#

If it's just the number of decent evals that matters then we're fine

young sparrow Jun 16, 2023, 12:05 AM

#

I think that MMLU is probably important to include

last mauve Jun 16, 2023, 12:06 AM

#

karmic tree

This should be resolved by the new scaling plots I believe

young sparrow Jun 16, 2023, 12:06 AM

#

But, if someone explains how to run the model through the eval harness (HF is still borked) I can take care of things

last mauve Jun 16, 2023, 12:14 AM

#

Evals were already done before we started the arxiv. @obsidian quest or @tropic minnow -- Who ran these evals and how can Stella reproduce them?

tropic minnow Jun 16, 2023, 5:57 AM

#

last mauve Evals were already done before we started the arxiv. <@870137517020688415> or <@...

@obsidian quest did here #1103039376184852622 message

#

and the plots for the 6 tasks were done with this: #1103039376184852622 message, maybe @serene badge can comment more on any other mods

Discord

Discord - A New Way to Chat with Friends & Communities

Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

obsidian quest Jun 16, 2023, 9:31 AM

#

last mauve Evals were already done before we started the arxiv. <@870137517020688415> or <@...

i ran them and you can try reproducing them using HF package (to check the correctness of HF too)

young sparrow Jun 16, 2023, 12:20 PM

#

obsidian quest i ran them and you can try reproducing them using HF package (to check the corre...

The HF package doesn’t run currently

obsidian quest Jun 16, 2023, 12:23 PM

#

young sparrow The HF package doesn’t run currently

whats the error? can tell them

young sparrow Jun 16, 2023, 12:24 PM

#

I’ll reproduce it in a bit and let you know

young sparrow Jun 16, 2023, 1:16 PM

#

Using pretrained=RWKV/rwkv-4-169m-pileraises

  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 384, in forward
    attention, state = self.attention(self.ln1(hidden), state=state, use_cache=use_cache)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 308, in forward
    receptance, key, value, state = self.extract_key_value(hidden, state=state)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 300, in extract_key_value
    key = self.key(key)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

#

Meanwhile BlinkDL/rwkv-4-pile-169m appears to be misconfigured as it lacks a config.json

#

okay now it's wroking and I have no idea what changed

#

...

young sparrow Jun 16, 2023, 1:43 PM

#

@obsidian quest Do you have the smaller models with all these benchmakrs too? Or just the biggest RWKV

#

This code doesn't run because it relies on a file called rwkv.csv, can you share that file

obsidian quest Jun 16, 2023, 1:45 PM

#

young sparrow <@870137517020688415> Do you have the smaller models with all these benchmakrs t...

all models from 0.1B to 14B are there

young sparrow Jun 16, 2023, 1:46 PM

#

Oh I missed that colimn

#

and that's the .csv I was looking for, isn't it

#

Okay, so we actually have pretty comprehensive evals they're just not fully presented

young sparrow Jun 16, 2023, 2:12 PM

#

@obsidian quest the .csv you shared says that the context length of the largest model is 8192 and that the 3B model is 4k. Is that correct? What's the context length for the models that don't have a listed context length?

obsidian quest Jun 16, 2023, 2:22 PM

#

young sparrow <@870137517020688415> the .csv you shared says that the context length of the la...

not listed -> ctxlen 1024

young sparrow Jun 16, 2023, 2:23 PM

#

Why does the context length change

obsidian quest Jun 16, 2023, 2:23 PM

#

all are trained with ctx1024 for 1 epoch, and then finetuned to 2k => 4k => 8k

young sparrow Jun 16, 2023, 2:24 PM

#

Right, but why are we comparing evaluations on models of different context lengths

#

Why isn't it consistent

obsidian quest Jun 16, 2023, 2:24 PM

#

longer ctxlen => slightly worse zeroshot if everything being equal
because these tasks only care abt short ctxlen

#

it's just that 4k and 8k models are trained longer, so 7B & 14B can gain some advantage from this

young sparrow Jun 16, 2023, 2:25 PM

#

dude

obsidian quest Jun 16, 2023, 2:25 PM

#

1.5B & 3B ctx4k are slightly worse than ctx1k for this

young sparrow Jun 16, 2023, 2:26 PM

#

You can't do this in a paper

obsidian quest Jun 16, 2023, 2:26 PM

#

you can list all ctx1k numbers

young sparrow Jun 16, 2023, 2:27 PM

#

Do you have context 1k numbers for all the models? The csv you sent doesn't for 3B or 14B

obsidian quest Jun 16, 2023, 2:27 PM

#

you can list all ctx1k numbers

RWKV-4    3-ctx1k    5.24     57.52%    63.94%    73.72%    70.28%    59.63%    59.43%    31.83%    64.27%    28.74%    37.60%    85.70%    11.07%    80.56%    81.00%
R14 ctx1k    14.2    3.81     63.54%    71.05%    77.42%    75.57%    70.24%    62.98%    38.31%    70.71%    32.28%    40.60%    90.10%    24.06%    85.73%    87.00%

paper dove Jun 16, 2023, 3:00 PM

#

last mauve I've submitted a version along with the abstract for EMNLP. **If you did not re...

just create an account, my email is [email protected]

serene badge Jun 16, 2023, 3:08 PM

#

tropic minnow and the plots for the 6 tasks were done with this: https://discord.com/channels/...

The 13 benchmark results of RWKV-4, Pythia, GPT-J are included in RWKV.csv.
The 6 benchmark results ("lambada", "piqa", "winogrande", "arc_challenge", "arc_easy", "sciq) of OPT, BLOOM come from pythia/result directory of pythia repo.
Seems the json files of OPT, BLOOM do not contain the other 7 benchmarks ("triviaqa","storycloze16","hellaswag","headqa","openbookQA","record","copa").
I think that's why in the script from @rich raptor , we only plot figures for 6 benchmarks.

young sparrow Jun 16, 2023, 3:17 PM

#

serene badge The 13 benchmark results of RWKV-4, Pythia, GPT-J are included in RWKV.csv. The ...

The right way to solve this problem is to run those models on the additional benchmarks. I’m currently doing this

last mauve Jun 17, 2023, 4:16 AM

#

young sparrow okay now it's wroking and I have no idea what changed

this was a bad node

tough crane Jun 17, 2023, 7:33 AM

#

last mauve I've submitted a version along with the abstract for EMNLP. **If you did not re...

Just created an open reveiw account for Email that I've already sent to you

gusty condor Jun 17, 2023, 10:36 AM

#

I just created OpenReview account too. Been so busy with my final exams

obsidian quest Jun 17, 2023, 4:27 PM

#

L32 D2560 T64 pred_loss 2.047399

fickle hare Jun 18, 2023, 11:19 AM

#

@obsidian quest I'm trying to add the exact hyperparameters to the Appendix. In #1083107245971226685 message you presented 6 column groups, are they in the order of 14B/7B/.../169M? In each column group, is the last column tokens trained? Also, it seems your adjustment on batch size during training is not directly visible in this table?

obsidian quest Jun 18, 2023, 11:24 AM

#

fickle hare <@870137517020688415> I'm trying to add the exact hyperparameters to the Appendi...

14/7/3/1.5/0.43/0.169B
yeah tokens trained (ends at 332G tokens)
yeah invisible

fickle hare Jun 18, 2023, 11:26 AM

#

(another point influencing the reproducibility)

#

I can try to recover it though. All training is done with ctxlen=1024 right?

fickle hare Jun 18, 2023, 1:14 PM

#

obsidian quest bsz = 128samples x 1024tokens

I'm a bit confused, is the batch size = 128samples for each GPU? Cause in the LR history file it shows 8043 steps for 332 billion tokens, which counts to ~40000 samples of 1024 tokens each step.

#

Also through analyzing the Gtokens I don't observe any batch size change. It goes smoothly all the way down.

#

(all 315 * 128 * 1024, guess you are using 315 GPUs or nodes lol)

obsidian quest Jun 18, 2023, 1:36 PM

#

fickle hare I'm a bit confused, is the batch size = 128samples for each GPU? Cause in the LR...

8043 "miniepoch". lots of steps in a miniepoch

#

i use 128 or 256 as total bsz. or you may say 128x1024 or 256x1024

fickle hare Jun 18, 2023, 1:38 PM

#

I see

#

uh it's the epoch_steps in your code

obsidian quest Jun 18, 2023, 1:38 PM

#

real steps per miniepoch = 40320 / bsz

fickle hare Jun 18, 2023, 1:39 PM

#

so we won't be able to report the accurate batch size then i guess?

obsidian quest Jun 18, 2023, 1:44 PM

#

you can, from https://wandb.ai/blinkdl/RWKV-v4-Pile histories

W&B

blinkdl

Weights & Biases, developer tools for machine learning

#

but there are 2068 runs

#

because all runs are killed multiple times due to server issues

#

apply filter for nlayer & ndim & ctx1024 & datafile = BlinkDL/pile/pile_20B, and check the run around the release date on HF

fickle hare Jun 18, 2023, 1:46 PM

#

let me put the numbers we have in hand into the paper first

#

if i still have time later but not too late, i'll try dig it out

fickle hare Jun 18, 2023, 2:09 PM

#

Added Appendix Hyperparameter.

#

related cross-reference is also brought back (previously commented out)

tropic minnow Jun 18, 2023, 9:45 PM

#

@sullen horizon hows LRA going?

young sparrow Jun 18, 2023, 11:43 PM

#

obsidian quest L32 D2560 T64 pred_loss 2.047399

Are you discarding the ten points with the least compute here?

obsidian quest Jun 19, 2023, 12:08 AM

#

i am using these 12 points

young sparrow Jun 19, 2023, 12:12 AM

#

Why not 6 512 1.0

#

Why those numbers specifically? Even using them, I'm unable to reproduce your fit and looking at the plot it's not at all clear why those were chosen

f4wdOxbVqlVDt27dcPjwYfj4OglbiIqOUEsykxNIiLSShAErF27Ft26dTN1KERkAqxIEREREemIiRQRERGRjrj8ARFRCXB2BFHZxooUERERkY6YSBERERHpiIkUERERkY6YSBERERHpiIkUERERkY6YSBERERHpiIkUERERkY6YSBERERHpiIkUERERkY7D4xgj8wJFKKAAAAAAElFTkSuQmCC.png

Screen_Shot_2023-06-18_at_8.25.22_PM.png

obsidian quest Jun 19, 2023, 12:32 AM

#

young sparrow Why those numbers specifically? Even using them, I'm unable to reproduce your fi...

use non-embedding params

young sparrow Jun 19, 2023, 12:32 AM

#

Is that 2*V*D + 13*D*D*L

obsidian quest Jun 19, 2023, 12:33 AM

#

simply 13*D*D*L

young sparrow Jun 19, 2023, 12:34 AM

#

(rerunning, vaguely embarressed I missed that)

#

#

So how did you pick these specific points to include in your fit

obsidian quest Jun 19, 2023, 12:43 AM

#

ok pls use this. the idea is to pick larger models as T grows

#

for example, the optimal T for L12 D768 is likely around 3

obsidian quest Jun 19, 2023, 12:48 AM

#

young sparrow Why not `6 512 1.0`

very small & very early results are outliers

young sparrow Jun 19, 2023, 12:52 AM

#

Okay, but why not this one?

#

This one actually shows all the compute-optimal values

#

I'm worried about the excessive reliance on heuristics

fickle hare Jun 19, 2023, 6:31 AM

#

shouldn't this be a simple envelope?

obsidian quest Jun 19, 2023, 7:49 AM

#

the envelope is simple in my table

#

the second one is non-optimal here

sullen horizon Jun 19, 2023, 12:21 PM

#

tropic minnow <@450848180532150272> hows LRA going?

RWKV@LRA code is hear https://github.com/diggerdu/rwkv-long-range-arena (based on s4, I will update readme lately)

GitHub

GitHub - diggerdu/rwkv-long-range-arena: LRA Benchmark RWKV

LRA Benchmark RWKV. Contribute to diggerdu/rwkv-long-range-arena development by creating an account on GitHub.

karmic tree Jun 19, 2023, 3:54 PM

#

obsidian quest the second one is non-optimal here

I don't see any other points that dominate it

obsidian quest Jun 19, 2023, 4:01 PM

#

karmic tree I don't see any other points that dominate it

because we are still missing some intermediate models here

young sparrow Jun 19, 2023, 4:12 PM

#

obsidian quest because we are still missing some intermediate models here

What do you mean “intermediate models”? Do you mean partially trained ones?

obsidian quest Jun 19, 2023, 4:14 PM

#

young sparrow What do you mean “intermediate models”? Do you mean partially trained ones?

like L9-D768, L18-D1024, etc.

young sparrow Jun 19, 2023, 4:27 PM

#

obsidian quest like L9-D768, L18-D1024, etc.

Do you have data for those models? They don't seem to be on WandB

obsidian quest Jun 19, 2023, 4:29 PM

#

i mean we havent tested them

young sparrow Jun 19, 2023, 4:31 PM

#

So no, you don’t know that they perform better

#

It’s really important on a scientific level to not make things up like that. If you want to run them great, let’s add them. But you can’t say “oh I know how this experiment we haven’t done will turn out”

tropic minnow Jun 19, 2023, 6:07 PM

#

sullen horizon RWKV@LRA code is hear https://github.com/diggerdu/rwkv-long-range-arena (based...

nice! do you need/want anything in order to run it?

tropic minnow Jun 19, 2023, 6:15 PM

#

obsidian quest like L9-D768, L18-D1024, etc.

yea i think if we don't have a better datapoint in our data, then that's point is the optimal we have been able to get so far. imo the methods should be as good as possible, even if they don't account for corrections that we might have intuition on but are unproved so far.

young sparrow Jun 19, 2023, 8:00 PM

#

This is the best we can get with the current data. In the last plot, we see the slope of the line corresponding to Chinchilla scaling I do believe that this line is likely much closer to the true value, but we don't have the sampling density to really tell.

#

Screen_Shot_2023-06-19_at_4.00.53_PM.png

Screen_Shot_2023-06-19_at_4.01.10_PM.png

Screen_Shot_2023-06-19_at_4.01.18_PM.png

Screen_Shot_2023-06-19_at_4.01.27_PM.png

#

(click on images to see the equation for the trend line and r^2)

young sparrow Jun 19, 2023, 9:47 PM

#

There's a bit of missing data still running (will be done by the end of the day today) but I otherwise have the missing plots as well

last mauve Jun 19, 2023, 10:00 PM

#

young sparrow There's a bit of missing data still running (will be done by the end of the day ...

what's going on with boolq

young sparrow Jun 19, 2023, 10:00 PM

#

last mauve what's going on with boolq

Placeholder zeros are missing values to avoid the script from crashing

last mauve Jun 19, 2023, 10:01 PM

#

ah ok

young sparrow Jun 19, 2023, 10:01 PM

#

aka "I forgot to save most of the BoolQ results"

karmic tree Jun 19, 2023, 11:06 PM

#

young sparrow There's a bit of missing data still running (will be done by the end of the day ...

Very cool. OPT is weird in COPA, interesting finding. Could I ask for bigger symbols? These red/green tones are a pain to distinguish and the up/down triangles aren't so distinct this size. Happy to edit the graphing code

young sparrow Jun 19, 2023, 11:09 PM

#

karmic tree Very cool. OPT is weird in COPA, interesting finding. Could I ask for bigger sym...

Yeah I've put no effort into the data viz, planning on doing that in a bit

young sparrow Jun 19, 2023, 11:52 PM

#

@obsidian quest I'm noticing that some of the RKWV evaluations are using acc_norm and others are using acc. Do you have all the results for acc?

last mauve Jun 20, 2023, 2:33 AM

#

last mauve Just added another batch of authors. Some that I'm still missing: - ~~Michael Ch...

@everyone -- Many of these still need done. If you don't create and send me these OpenReview accounts by the paper deadline you will not be an author.

young sparrow Jun 20, 2023, 3:59 AM

#

karmic tree Very cool. OPT is weird in COPA, interesting finding. Could I ask for bigger sym...

Plots now look like this (see paper for all of them), but if you'd like to fiddle with it more you're welcome to

Screen_Shot_2023-06-19_at_11.58.37_PM.png

outer vine Jun 20, 2023, 9:28 AM

#

hello, i think i may find a little bug about RWKV initialization. In the paper, we said that we initialize all W_{r}, W_{k}, W_{v} to be zeros, but it is not the case in the RWKV-4

#

https://github.com/BlinkDL/RWKV-LM/blob/cca1b5e8e597cf40675882bb10b46287c844e35c/RWKV-v4/src/model.py#L148

GitHub

RWKV-LM/RWKV-v4/src/model.py at cca1b5e8e597cf40675882bb10b46287c84...

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

#

this line of code is never used in the Init function (which i believe is to control the zero initialization)

#

by adding a print debug line here, i also find that the matrix is not initialized to 0

#

#

#

does anyone also notice this? initializing all the parameters to be zeros seems a little bit weird

#

void quartz Jun 20, 2023, 1:02 PM

#

outer vine hello, i think i may find a little bug about RWKV initialization. In the paper, ...

please check against v4neo (instead of v4)

#

https://github.com/BlinkDL/RWKV-LM/blob/cca1b5e8e597cf40675882bb10b46287c844e35c/RWKV-v4neo/src/model.py#L550

obsidian quest Jun 20, 2023, 1:36 PM

#

outer vine hello, i think i may find a little bug about RWKV initialization. In the paper, ...

see #1103039376184852622 message

#

#

pls fix that part

outer vine Jun 20, 2023, 2:04 PM

#

In RWKV-v4, this line of code is never called. So, i believe there is not a parameter matricx initialized to all zeros

#

I am testing v4neo

#

but i think our paper should correspond to v4?

outer vine Jun 20, 2023, 2:22 PM

#

sorry, but i can't run v4neo because of this AttributeError: partially initialized module 'charset_normalizer' has no attribute 'md__mypyc'

#

but it looks good

#

#

so please fix this in the appendix. Thanks

outer vine Jun 20, 2023, 2:25 PM

#

void quartz please check against v4neo (instead of v4)

thanks for response

fickle hare Jun 20, 2023, 2:42 PM

#

outer vine but i think our paper should correspond to v4?

no it's v4neo

#

(I guess it's worth mention somewhere in the paper, or just clean up the obsoleted ones in the code base)

#

git always keeps history, so leaving them there unused is unnecessary

young sparrow Jun 20, 2023, 2:47 PM

#

@fickle hare We should absolutely make a cleaned up codebase that only has the necessary components. The current codebase is pretty unusable to a new person.

fickle hare Jun 20, 2023, 2:53 PM

#

I've been working on a new Lightning 2.0-based trainer using the new CLI (the most recent improvements are by @void quartz). It's pretty usable now for finetuning, but data preprocessing is still in a preliminary state, and model initialization is missing. Just too busy these days.

young sparrow Jun 20, 2023, 2:55 PM

#

@obsidian quest which of the models are the ones hosted on the RWKV HF page? How many tokens were they trained, did they do sequence length extension?

outer vine Jun 20, 2023, 2:58 PM

#

fickle hare I've been working on a new Lightning 2.0-based trainer using the new CLI (the mo...

glad to see someone is using Lightning2.0 and its CLI rather than argparser or hydra🙌

fickle hare Jun 20, 2023, 2:58 PM

#

The rwkv-pile series are all trained on the 332G Pile. Checkpoints with ctxlen>1024 in the file name come from sequence extension.

young sparrow Jun 20, 2023, 3:00 PM

#

fickle hare The `rwkv-pile` series are all trained on the 332G Pile. Checkpoints with ctxlen...

None of the models on HF have that file name, so those models haven’t been uploaded to HF yet?

fickle hare Jun 20, 2023, 3:01 PM

#

ah

#

let me see

#

https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth

RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth · BlinkDL/rwkv-4-pile...

#

such as this one

#

However I now wonder if the 8043 epoch checkpoint is anywhere... https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230213-8019.pth for example 8019 is the latest I could find

RWKV-4-Pile-14B-20230213-8019.pth · BlinkDL/rwkv-4-pile-14b at main

young sparrow Jun 20, 2023, 3:06 PM

#

Blink told me that the official versions were going to be the ones on the RWKV org page

fickle hare Jun 20, 2023, 3:07 PM

#

i see. then i have no clue which checkpoint did they convert to HF format 😭

void quartz Jun 20, 2023, 3:32 PM

#

outer vine glad to see someone is using Lightning2.0 and its CLI rather than argparser or h...

you can find it here - if you want to read through it as an alternative - the existing v4neo has lots of "experiment flags" and can be hard to read : https://github.com/Blealtan/RWKV-LM-LoRA/tree/dev-infctx

( I am still helping bugfix and test it by using it extensively in my current experiments - will be helping adding the missing model init / preprocessing - cause i need it too 😉 )

GitHub

GitHub - Blealtan/RWKV-LM-LoRA at dev-infctx

RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inf...

fickle hare Jun 20, 2023, 3:41 PM

#

at some point it should go to a distinct repo and migrate to HF model, after state chained backward is supported by transformers.rwkv

young sparrow Jun 20, 2023, 4:30 PM

#

The "attention free models" section of the Related Work section was getting a little long so I split out the RNNs into a third subsection.

#

I also added some details about Hyena, as that's simultaneous work where they train a single-digit billion parameter state space model (and compare to us!)

neon night Jun 21, 2023, 7:04 AM

#

last mauve Just added another batch of authors. Some that I'm still missing: - ~~Michael Ch...

[email protected] @last mauve

outer vine Jun 21, 2023, 2:25 PM

#

for anyone who is interested in a clean code base of RWKV for comparison with GPT-series

#

https://github.com/Hannibal046/nanoRWKV

GitHub

GitHub - Hannibal046/nanoRWKV: minimal implementation of RWKV langu...

minimal implementation of RWKV language model following nanoGPT - GitHub - Hannibal046/nanoRWKV: minimal implementation of RWKV language model following nanoGPT

young sparrow Jun 21, 2023, 2:38 PM

#

Amazing!

last mauve Jun 21, 2023, 6:10 PM

#

@young sparrow -- Are you able to get your scaling + eval plots in today?

young sparrow Jun 21, 2023, 6:11 PM

#

last mauve <@193204646687408129> -- Are you able to get your scaling + eval plots in today?

I'm doing it now

last mauve Jun 21, 2023, 6:11 PM

#

Paper's due friday and I want ppl to be able to update writing accordingly in time

spiral minnow Jun 21, 2023, 6:15 PM

#

Figure 3 (0-shot performance on LM eval): Any intuitions on why Pythia performance drops significantly for the point with highest compute?

young sparrow Jun 21, 2023, 6:23 PM

#

spiral minnow Figure 3 (0-shot performance on LM eval): Any intuitions on why Pythia performan...

I filled in a couple missing data points with 0s. They're currently running

spiral minnow Jun 21, 2023, 6:24 PM

#

young sparrow I filled in a couple missing data points with 0s. They're currently running

Gotcha!

young sparrow Jun 21, 2023, 6:38 PM

#

@obsidian quest The paper currently says:

The number of parameters for each model is computed using the formula: $\text{# parameters} = 2VD + 13D^2L + D(11L+4)$ where $V$ = 50277 is the vocabulary size, $D$ represents the Model Dimension and $L$ corresponds to the number of layers. FLOPs is for a forward pass for one token. It was calculated as $2(2VD + 13D^2L)$, which is the twice (add and multiply) the number of parameters in linear layers. The backwards pass FLOPs can be approximated as twice that of the forward pass, giving a total of $6(2VD + 13D^2L)$ FLOP per token. Notable, this matches the standard formula for FLOP calculations in transformers \citet{kaplan2020scaling} $$\text{FLOP} = 6\cdot [\text{# tokens}]\cdot [\text{# parameters}].$$

Can you confirm that this is correct

silent urchinBOT Jun 21, 2023, 6:38 PM

#

Stella Biderman (she/her)
Compile Error! Click the errors reaction for more information.
(You may edit your message to recompile.)

young sparrow Jun 21, 2023, 6:44 PM

#

@tough crane were you the person who put together the evaluation tables in the appendix?

tough crane Jun 21, 2023, 6:51 PM

#

young sparrow <@841286386355011594> were you the person who put together the evaluation tables...

yes in part but I did not compute its metric vals(other person pasted a raw metric valuues). Is there some wrong or incorrect values?

young sparrow Jun 21, 2023, 6:56 PM

#

tough crane yes in part but I did not compute its metric vals(other person pasted a raw metr...

The models we compare to in the tables are inconsistent with the plots. I think it would be a good idea to update the tables to show the numerical scores from the plots.

#

(This is a historical artifact of Blink starting this work before Pythia existed)

#

If you post the code that generates the tables I can update it pretty easily

tough crane Jun 21, 2023, 7:07 PM

#

young sparrow If you post the code that generates the tables I can update it pretty easily

If you post the code

Is this a latex code or code for model inferences?

young sparrow Jun 21, 2023, 7:07 PM

#

tough crane > If you post the code Is this a latex code or code for model inferences?

The code for generating the plots from the evals

tough crane Jun 21, 2023, 7:23 PM

#

young sparrow The code for generating the plots from the evals

As you pointed out, ours script/notebook is to load Blink's older experiments RWKV.csv

📎 accuracy.ipynb

#

Blink's experiment accs

📎 RWKV.csv

young sparrow Jun 21, 2023, 7:24 PM

#

tough crane As you pointed out, ours script/notebook is to load Blink's older experiments `...

So, did you just manually make the table in the paper?

tough crane Jun 21, 2023, 8:17 PM

#

young sparrow So, did you just manually make the table in the paper?

While I could not find the converter to convert exact overlewaf source at this moment, this is a nearly same one. ( I added captions and other probably person modified this source. )

📎 conv2latex.ipynb

young sparrow Jun 21, 2023, 8:57 PM

#

@last mauve @tropic minnow The scaling laws and evaluation sections are largely done. I'm tweaking some of the wording of the context length extension experiments because it's not true that quadratic transformers can't scale to a context lenght of 8k, but don's currently anticipate major changes to the sections. We were far over the page limit, so I commented out Section 2: Related Work and it seems to fit pretty well now. If people are okay with that not being in the main body, I can move it to the appendix.

#

The current version has the plots in the main body because I find plots much more accessible than tables, but we could flip that and put the tables in the main body (LLAMA does this, for example, but most don't). This would require substantially less space.

last mauve Jun 21, 2023, 9:17 PM

#

young sparrow <@367104793292046338> <@469771066399784971> The scaling laws and evaluation sect...

Got it. For related work, I'm of the opinion that we shouldn't include all of these eval figures in the main body. I think we should just have a few of the most influential 3-6 evals, then put the rest in the appendix along with tables. The related work section for this paper is especially important since few are familiar with attention-free transformers.

young sparrow Jun 21, 2023, 9:24 PM

#

The spacing is janky but I've made that change

#

I generally dislike reporting mean accuracy across tasks, but that's something we can do here

#

I have to run, but I can make the new table tonight or tomorrow morning

#

The more I think about the sequence length stuff the more suspicuos of it I am though

#

This shows loss on the Pile batched by the sequence length of the sequences that we are evaluating on.

#

The claim this plot is making is that we perform better at predicting long sequences than short ones. Maybe that’s to be expected (though I don’t think so) but the effect size worries me. That’s a huge drop in loss!

#

The left half of the image is basically meaningless because sequences of a handful of tokens are often noise

#

But the idea that we see a real loss decrease when subseting our evals to 8k sequences instead of 1k ones seems suspicious to me

spring fulcrum Jun 21, 2023, 11:02 PM

#

the alibi paper appendix has an appendix on “the early token curse” as a cause for ppl decreasing as seqlen increases: https://arxiv.org/abs/2108.12409

arXiv.org

Train Short, Test Long: Attention with Linear Biases Enables Input ...

Since the introduction of the transformer model by Vaswani et al. (2017), a
fundamental question has yet to be answered: how does a model achieve
extrapolation at inference time for sequences that are longer than it saw
during training? We first show that extrapolation can be enabled by simply
changing the position representation method, though ...

obsidian quest Jun 22, 2023, 12:11 AM

#

young sparrow The claim this plot is making is that we *perform better* at predicting long seq...

the loss difference between 2^7 = 128 and 2^12 = 4096 is not much

tough crane Jun 22, 2023, 7:12 AM

#

Should we have more experiments for much longer lengths from 2^13 to the length comparable model's settings like Hyena?

tough crane Jun 22, 2023, 7:36 AM

#

Alibi 's experiments are tested up to 6k

young sparrow Jun 22, 2023, 12:17 PM

#

tough crane Should we have more experiments for much longer lengths from 2^13 to the length ...

I think it would be better, but that there isn’t time

fickle hare Jun 22, 2023, 1:02 PM

#

at least cut the half below 128 tokens which is not really meaningful?

young sparrow Jun 22, 2023, 1:15 PM

#

fickle hare at least cut the half below 128 tokens which is not really meaningful?

Yeah, we should absolutely cut the points below 2^5

last mauve Jun 22, 2023, 6:58 PM

#

Ok so we're near the finish line here

young sparrow Jun 23, 2023, 3:50 AM

#

Here's what the average across all 12 NLP tasks looks like btw

Screen_Shot_2023-06-22_at_11.50.04_PM.png

young sparrow Jun 23, 2023, 8:00 PM

#

@obsidian quest @last mauve @tropic minnow I've done a lot of fiddling with the paper with the primary goal of making sure that all the results in the appendix are actually referenced in the main text, while not going over the page limit. I'm stopping now before I go crazy fiddling over details.

#

(Feel free to disregard if you don't want to update the submitted paper.)

last mauve Jun 23, 2023, 9:14 PM

#

young sparrow <@870137517020688415> <@367104793292046338> <@469771066399784971> I've done a lo...

You fixed every issue I had on my TODO list before submitting 🙂

#

I'm submitting a final version now. If anyone has any last-minute edits they want reflected before the deadline tonight, ping me here.

obsidian quest Jun 27, 2023, 4:37 AM

#

hi should remove this arrow

tropic minnow Jun 27, 2023, 7:39 AM

#

obsidian quest hi should remove this arrow

Yes probably. It doesnt represent tokenshift appropiately… will remove it and update the latex figure, but emnlp submission is already done…🙃so it’ll have to go in the updated version

outer vine Jun 27, 2023, 5:15 PM

#

hi, may i ask what is the detailed setting in benchmarking rwkv inference in the Figure 7 of the paper? From my side, i couldn't get the same results.

#

this is the code i use: https://gist.github.com/Hannibal046/b57f44779484b466f3d33f537c87443d

#

and this is the result:

#

on one A100, float32, no compile, batch=1, generate 1024 new tokens

#

BTW, the Figure 7 in paper is never referred or explained

young sparrow Jun 27, 2023, 5:42 PM

#

outer vine BTW, the Figure 7 in paper is never referred or explained

Are you looking at the arXiv copy or the new copy

outer vine Jun 27, 2023, 5:42 PM

#

emnlp version

tropic minnow Jun 27, 2023, 6:15 PM

#

outer vine and this is the result:

~~how are you using RWKV?~~

tropic minnow Jun 27, 2023, 6:17 PM

#

outer vine this is the code i use: https://gist.github.com/Hannibal046/b57f44779484b466f3d3...

these were the scripts used

📎 inference_time_rwkv_1.py 📎 inference_time.py

#

probably it would be better to release scripts in the open for people to reproduce. @snow zealot are you ok?

outer vine Jun 27, 2023, 6:19 PM

#

thanks so much! I would check this

snow zealot Jun 27, 2023, 6:22 PM

#

tropic minnow probably it would be better to release scripts in the open for people to reprodu...

Is it ok for EMNLP?

outer vine Jun 27, 2023, 6:23 PM

#

i notice that rwkv is tested with original implementation rather than HF implementaion

#

is there any problems with HF implementation yet?

#

I simply uses this:

young sparrow Jun 27, 2023, 6:49 PM

#

outer vine i notice that rwkv is tested with original implementation rather than HF impleme...

The NLP evaluations in the paper use the models found herehttps://huggingface.co/RWKV

RWKV (RWKV)

outer vine Jun 27, 2023, 6:53 PM

#

yeah, i test models from this HF space. But from using model.generate() method rather than forward() with torch profile, there is not that huge gap as shown in the paper

tropic minnow Jun 27, 2023, 7:43 PM

#

snow zealot Is it ok for EMNLP?

i mean the RWKV codebase is public... and the preprint as well... so i think as long as we dont promote it it should be... but yea we can wait probably

snow zealot Jun 27, 2023, 7:51 PM

#

outer vine yeah, i test models from this HF space. But from using model.generate() method r...

when we run the tests the HF implementation had some bugs
#1076516707201466388 message

outer vine Jun 28, 2023, 7:27 AM

#

snow zealot when we run the tests the HF implementation had some bugs https://discord.com/ch...

It seems the benchmark scripts doesn't use kv cache for transformer-based model?

#

I believe it would be a more equitable comparison if we could also pass the KV cache to the Transformer while providing the state to the RWKV. This would ensure a fair assessment of both methods.

#

tropic minnow Jun 28, 2023, 8:59 AM

#

outer vine I believe it would be a more equitable comparison if we could also pass the KV c...

kind of, but then you'll OOM faster wont you?

outer vine Jun 28, 2023, 9:01 AM

#

why kv cache would cause faster OOM compared with full computation?

#

https://discuss.huggingface.co/t/generate-using-k-v-cache-is-faster-but-no-difference-to-memory-usage/31272/2

Hugging Face Forums

Generate: using k-v cache is faster but no difference to memory usage

Nice write-up! I think the decoder sequence length and the hidden states of the model might be too small to see a difference here in VRAM. The reason VRAM should be higher when caching the k,v states is because we cache the projected k,v states of every layer. This means that our cache is of size: 2 * (hidden_size) * (num_layers) * (decoder_l...

#

ok, for 80G a100, batchsize=1, would this be a big problem?

tropic minnow Jun 28, 2023, 9:08 AM

#

hmm i see... okay we can try that? is it easy to setup in HF?

#

seems all GPTNeoXForCausalLM have the use_cache=True option we could use.

outer vine Jun 28, 2023, 9:12 AM

#

yeah, all AutoModelForCausalLM in HF have use_cache option

#

but it is only useful when calling model.generate() method

tropic minnow Jun 28, 2023, 9:16 AM

#

outer vine but it is only useful when calling `model.generate()` method

doesn't the forward method work? it seems from the docs it returns KV as well, which you can pass to next call

tropic minnow Jun 28, 2023, 9:17 AM

#

obsidian quest hi should remove this arrow

this is done

outer vine Jun 28, 2023, 9:18 AM

#

outer vine but it is only useful when calling `model.generate()` method

here "useful", i mean it was only used in when model needs to generate something. It certainly works for forward() since model.generate() consists of bunches of forward()

tropic minnow Jun 28, 2023, 9:20 AM

#

outer vine here "useful", i mean it was only used in when model needs to generate something...

ok. ill modify the script and will re-run the experiment soon (tmrw?). with that to see if there's any diff.

outer vine Jun 28, 2023, 9:24 AM

#

I believe this problem presents a certain level of complexity, as the real-time cost is determined by a combination of factors such as the algorithm (architecture) and hardware (V-RAM, GPU generation). There are numerous options for benchmarking the inference speed by combining these elements, such as using a GPU with small V-RAM, a GPU without tensor cores, or a GPU that does not support bf16, among others.

However, the most straightforward approach, in my opinion, would be the following:

start_time = time.time()  
new_tokens = GPT/RWKV.generate()  
end_time = time.time()

Even with this method, there are still various possible variations. For instance, if we were to test on a GPU with limited V-RAM, a transformer-based model with kv cache might need to perform frequent exchanges between GPU and CPU memory, which could result in significant latency.

#

I would like to kindly recommend that, for a model with favorable time and space complexity during inference, it would be beneficial to utilize a product-level GPU such as the K40 for comparisons with other Transformer-based models. It is worth noting that employing an A100 GPU for serving is not a common practice within the industry.

tropic minnow Jun 28, 2023, 11:22 AM

#

outer vine I believe this problem presents a certain level of complexity, as the real-time ...

we run the tests on an a100 80gb gpu which is the best as it can get atm imo

outer vine Jun 28, 2023, 12:19 PM

#

tropic minnow we run the tests on an a100 80gb gpu which is the best ~~as~~ it can get atm imo

why it is the best experimental setting?

outer vine Jul 4, 2023, 6:17 PM

#

Hi, @tropic minnow , do you finish the code? I implement it on my side and the results are contrary. This is on one A100(80G) gpu.

#

#

with this gist of code: https://gist.github.com/Hannibal046/b57f44779484b466f3d33f537c87443d

#

i cann't think of a reason not using kv cache for transformer model in inference

outer vine Jul 4, 2023, 6:19 PM

#

tropic minnow these were the scripts used

and the memory usage of this code is also wrong. It keeps gradient.

#

so i would suggest using product-level GPU/ longer context/ large model size to show the superiority of a inference-friendly model, RWKV.

#

the current figure in the paper is misleading

#

tropic minnow Jul 4, 2023, 7:03 PM

#

outer vine Hi, <@469771066399784971> , do you finish the code? I implement it on my side an...

sorry been busy

tropic minnow Jul 4, 2023, 7:04 PM

#

outer vine and the memory usage of this code is also wrong. It keeps gradient.

hmm well we can try detaching

obsidian quest Jul 5, 2023, 3:12 AM

#

outer vine

RWKV HF package is still buggy. pls use rwkv pip package with
os.environ['RWKV_JIT_ON'] = '1'
os.environ["RWKV_CUDA_ON"] = '1'
example code: https://github.com/BlinkDL/ChatRWKV/blob/main/API_DEMO.py

GitHub

ChatRWKV/API_DEMO.py at main · BlinkDL/ChatRWKV

ChatRWKV is like ChatGPT but powered by RWKV (100% RNN) language model, and open source. - ChatRWKV/API_DEMO.py at main · BlinkDL/ChatRWKV

outer vine Jul 5, 2023, 4:55 AM

#

hi, @obsidian quest where is the buggy part of RWKV HF implementation? Maybe i can help fix it. The main point here is that if we use a large enough and fast enough GPU(A100) to benchmark inference speed, the Transformer is also linear. Check this:

#

#

I would recommend a detailed derivation here: https://kexue.fm/archives/8610.

obsidian quest Jul 5, 2023, 5:01 AM

#

outer vine hi, <@870137517020688415> where is the buggy part of RWKV HF implementation? May...

you should show "time to generate the # token" (the derivative) instead of cumulative time

obsidian quest Jul 5, 2023, 5:01 AM

#

outer vine hi, <@870137517020688415> where is the buggy part of RWKV HF implementation? May...

try rwkv pip package first

outer vine Jul 5, 2023, 5:02 AM

#

cumulative time is what we show in the paper..

#

obsidian quest Jul 5, 2023, 5:05 AM

#

with the correct implementation, rwkv will be like a const line around 10ms

outer vine Jul 5, 2023, 5:05 AM

#

ok, i would try rwkv pip

outer vine Jul 5, 2023, 9:25 AM

#

hi, I tried rwkv package and this is the result

#

#

but i am not so sure if this is a fair comparison

obsidian quest Jul 5, 2023, 12:01 PM

#

rwkv pip package is using pytorch for almost everything, except the WKV operator

#

while HF transformers are using MHA operator instead

#

both WKV and MHA are CUDA operators, so i will say it's a fair comparison

outer vine Jul 5, 2023, 1:15 PM

#

but rwkv pip uses torchscript, right? Actually, transformers could be fast enough with various optimization techniques (e.g. vllm)

#

#

and i have looked through the HF implementation, while no obvious bugs found

#

one possible difference was: HF implementation doesn't use wkv kernel when doing inference

#

outer vine Jul 5, 2023, 2:16 PM

#

oh, i find that HF implementation could be significantly boosted by using torch.compile

#

obsidian quest Jul 5, 2023, 2:39 PM

#

can you try it for rwkv-pip too?

fickle hare Jul 5, 2023, 3:39 PM

#

the pip version is using torch 1.x jit, might be less efficient than 2.0 compile

outer vine Jul 6, 2023, 1:15 AM

#

obsidian quest can you try it for rwkv-pip too?

hi, I tried. But jit in rwkv-pip is not compatible with torch.compile

#

#

I am using HF implementation with torch.compile and longer context length

#

https://gist.github.com/Hannibal046/b57f44779484b466f3d33f537c87443d

#

this complies with the induction that Transformers are only quadratic when # tokens is big enough

obsidian quest Jul 6, 2023, 5:05 AM

#

transformers per-token speed = const factor + linear factor

#

accumulated time = linear factor + quadratic factor

obsidian quest Jul 6, 2023, 5:06 AM

#

outer vine hi, I tried. But jit in rwkv-pip is not compatible with torch.compile

can you show how to do torch.compile for HF rwkv thank you

outer vine Jul 6, 2023, 7:45 AM

#

obsidian quest can you show how to do torch.compile for HF rwkv thank you

Check the gist of code above

outer vine Jul 6, 2023, 7:48 AM

#

obsidian quest accumulated time = linear factor + quadratic factor

Agree, but the figure in the current paper compares transformers without kv cache with rnn-rwkv

#

#

From long former paper

obsidian quest Jul 6, 2023, 11:28 AM

#

outer vine Agree, but the figure in the current paper compares transformers without kv cac...

we can compare torch.compile(HF implementation) of all models (and with kv cache)
accumulated & per-token

outer vine Jul 6, 2023, 11:28 AM

#

yes, that is what i mean

#

the current one only evaluate on 1k context, where the Transformer and RWKV are both linear

#

#

hope this could be reflected on the next version of our paper

tropic minnow Jul 6, 2023, 2:05 PM

#

outer vine hope this could be reflected on the next version of our paper

yes it can be! if we could get all figures ready (this one for 3b), but also about 8192 tokens generation for different sizes that would be great

young sparrow Jul 6, 2023, 2:07 PM

#

I'm also hoping to have MMLU numbers in the next version (though have deprioritized this as we can't update the paper for a couple months still)

outer vine Jul 6, 2023, 3:07 PM

#

tropic minnow yes it can be! if we could get all figures ready (this one for 3b), but also abo...

one click run. The commands are attached below. And you can simply switch to any HF model. https://github.com/Hannibal046/nanoRWKV/blob/main/benchmark_inference_time.py

outer vine Jul 6, 2023, 3:09 PM

#

young sparrow I'm also hoping to have MMLU numbers in the next version (though have deprioriti...

why can't we update the arxiv version now? EMNLP seems fine for this?
do you have the MMLU numbers for now? happy to see

young sparrow Jul 6, 2023, 3:09 PM

#

outer vine why can't we update the arxiv version now? EMNLP seems fine for this? do you ha...

EMNLP is very explicitly not fine with this. Doing so will get our paper rejected.

outer vine Jul 6, 2023, 3:10 PM

#

hhhh, i didn't see it in call for paper. I must have missed something

young sparrow Jul 6, 2023, 3:11 PM

#

You may not make a non-anonymized version of your paper available online to the general community (for example, via a preprint server) during the anonymity period. Versions of the paper include papers having essentially the same scientific content but possibly differing in minor details (including title and structure) and/or in length.

[...]

You may not update the non-anonymized version during the anonymity period, and we ask you not to advertise it on social media or take other actions that would further compromise double-blind reviewing during the anonymity period.

https://2023.emnlp.org/calls/main_conference_papers/#anonymity-period

EMNLP 2023

Call for Main Conference Papers

Official website for the 2023 Conference on Empirical Methods in Natural Language Processing

outer vine Jul 6, 2023, 3:12 PM

#

got it. thanks

obsidian quest Jul 6, 2023, 5:11 PM

#

young sparrow I'm also hoping to have MMLU numbers in the next version (though have deprioriti...

it's bad at multiple choices. requires more such training data lol

young sparrow Jul 6, 2023, 5:16 PM

#

obsidian quest it's bad at multiple choices. requires more such training data lol

Pythia is trained on the same data

young sparrow Jul 7, 2023, 7:21 PM

#

@here I've made a short survey that I would appreciate people taking a moment to fill out. The primary goal is to get a better understanding of who comprises the members of our community. It should just take a minute and will be very useful 🙏

https://forms.gle/eTEtjGK4U7CfKBWT6

Google Docs

EleutherAI Community Survey

The purpose of this survey is to get a better sense of who comprises the EleutherAI community

obsidian quest Jul 8, 2023, 8:25 AM

#

https://twitter.com/BlinkDL_AI/status/1677593798531223552 A tiny RWKV with 2.9M (!) params can solve 18239.715 * 9.728263 or 4.2379 * 564.778 - 1209.01 etc. with CoT, while being 100% RNN (L6-D192) 🤯

BlinkDL (@BlinkDL_AI)

A tiny #RWKV with 2.9M (!) params can solve 18239.7159.728263 or 4.2379564.778-1209.01 etc. with CoT, while being 100% #RNN (L6-D192)🤯The trick: generate lots of data with reversed numbers (denoted by "f" here) to train the model🚀Try it now: https://t.co/l7CDb6Rirl

tender karma Jul 13, 2023, 3:05 PM

#

Hey all, I've just ported RWKV to Fortran! 🚀 Please take a quick look here: https://github.com/FortAI-Hub/rwkv.f90. Would love to hear your thoughts!

young sparrow Jul 14, 2023, 1:42 PM

#

@obsidian quest Let’s start keeping notes on adding languages to RWKV, in case you want to write another paper. It’ll make it easier to not have to go back and figure out what was done after the fact!

void quartz Jul 18, 2023, 4:26 AM

#

Not sure whats the procedure for paper feedback / corrections is - rwkv is cited here : https://arxiv.org/abs/2307.08621 - as a model without "training parallelization"

(hoping for someone here to know the process)

arXiv.org

Retentive Network: A Successor to Transformer for Large Language Mo...

In this work, we propose Retentive Network (RetNet) as a foundation
architecture for large language models, simultaneously achieving training
parallelism, low-cost inference, and good performance. We theoretically derive
the connection between recurrence and attention. Then we propose the retention
mechanism for sequence modeling, which supports...

outer vine Jul 18, 2023, 4:46 AM

#

the current implementation of RWKV training is indeed recurrent

#

but in theory, i believe it is also parallelizable

#

#

wkv_{t} actually doesn't depend on wkv_{t-1}

#

this retentive model uses a bunch of tricks to train while only refering to RWKV as Transformer with Time-mixing..

#

#

void quartz Jul 18, 2023, 8:16 AM

#

ahh so if i understood you right, they are using the stricter definition of training parallelisation? So they ain't wrong - but in practise is a meaningless distinction, because we can saturate our GPUs eitherway

outer vine Jul 18, 2023, 8:40 AM

#

and i am not sure if they are using torch.complextfloat, which may cause additional overhead

#

also curious, is there any Linearized Attention models scaling up?

void quartz Jul 18, 2023, 9:10 AM

#

i dun think they changed the RWKV code much - imo, cause there isn't a reason to do so

I guess it boils to the definition of how you define parallelization. This is currently my understanding on how RWKV runs in parallel.

x axis, is tokens, y is the layers somewhat, orange is layer norm, purple is time mix, green is channel mix

#

like strictly speaking everything past the first layer norm, does depend on the previous tiles - so if you define parallelization as being able to "compute independently" of other tokens then yes - we are a "not parallelizable" in that regard i guess?

even though in practise RWKV is still able to rapidly ramp up, and saturate the GPU across the multiple layers

#

which fits my understanding of "training parallelization" where it is more of "can we split the training process of a single data sample into enough threads to saturate a GPU" haha

outer vine Jul 18, 2023, 10:29 AM

#

void quartz i dun think they changed the RWKV code much - imo, cause there isn't a reason to...

please tell which repository implement this computing flow of RWKV

void quartz Jul 18, 2023, 10:40 AM

#

outer vine please tell which repository implement this computing flow of RWKV

if your referring to the digram? - i dun know if it's fully implemented, or partially implemented

its a visualization i have on what is potentially the optimal flow for RWKV from my understanding of the architecture

#

i assume its atleast partially implemented in the main repo, with pytorch / JIT / etc. If not we would never be able to saturate the GPU otherwise

#

( might need to get blink to confirm / deny how it flows in the main repo )

outer vine Jul 18, 2023, 10:42 AM

#

IMO, computing green box(channel mix) in parallel would be much faster..

#

i have read the source code in the main repo, it is computed sequentially layer-wise and time-wise

void quartz Jul 18, 2023, 11:04 AM

#

outer vine i have read the source code in the main repo, it is computed sequentially layer-...

yea, but pytorch if i understood correctly, builds a computational graph, and automatically split up the work to run in parallel ?

the question is more of does it actually do it the way we understand it to be haha

obsidian quest Jul 18, 2023, 2:30 PM

#

outer vine but in theory, i believe it is also parallelizable

if they consider large convolutions to be "parallelizable", then RWKV is certainly parallelizable

void quartz Jul 18, 2023, 3:31 PM

#

so i guess next step is to ping the author? not sure if they listed the twitter social media account in the paper (probably not?)

obsidian quest Jul 18, 2023, 4:40 PM

#

void quartz Not sure whats the procedure for paper feedback / corrections is - rwkv is cited...

ok it's basically linear transformer + xPos + exponential decay
so most of the tricks are parallel to rwkv, and i can add them too
now coding it to test

void quartz Jul 18, 2023, 4:41 PM

#

Haha. I will gladly run some experiments if you let me know the changes

outer vine Jul 18, 2023, 4:42 PM

#

obsidian quest ok it's basically linear transformer + xPos + exponential decay so most of the t...

agreed. Another kind of linearized attention

#

their code would be released within one week as said in the github repo

steady ether Jul 20, 2023, 12:37 AM

#

outer vine their code would be released within one week as said in the github repo

Someone made an unofficial implementation: https://github.com/Jamie-Stirling/RetNet

outer vine Jul 20, 2023, 11:51 AM

#

steady ether Someone made an unofficial implementation: https://github.com/Jamie-Stirling/Ret...

wow, amazing!

#

does anyone know much about complex in torch? wouldn't this cause huge latency compared with fp16 with tensorcore?

obsidian quest Jul 23, 2023, 4:09 PM

#

RWKV-5 preview with trainable time_decay
add --my_testing "r" to use it
https://github.com/BlinkDL/RWKV-LM/commit/9143748f8079e7d3c726c2b98a83681242da30f7

now with trainable time_first too:
https://github.com/BlinkDL/RWKV-LM/commit/686c962008676809f17cf2424c193d9dc217c0e4

GitHub

RWKV5 preview · BlinkDL/RWKV-LM@9143748

GitHub

rwkv5 with time_first · BlinkDL/RWKV-LM@686c962

#

void quartz Jul 24, 2023, 6:22 AM

#

For followup paper ideas, to the RWKV paper - would it be best to post it here, or another thread under publishing-help ?

outer vine Jul 24, 2023, 8:03 AM

#

are there some promising results for rwkv-5?

void quartz Jul 24, 2023, 8:47 AM

#

Sort of, though its not part of rwkv-5 yet.

I think i will outline it here first (let me know if i should repost this separately). As this is a compilation of an ongoing experiment between me and a few members of the RWKV community.

#

RWKV memory experiment v5/wavenet - update 1

While RWKV is able to match transformer performance on a wide variety of task. It generally stumble upon tasks with large data inputs or randomised datasets, that would need to be compressed and stored within its internal state by the model (Large document Q&A is a major example) - within the RWKV community, this is considered its "only weakness"

As such an ongoing effort to quantify, and benchmark this memory capacity was started, where we measure the model performance on receiving randomised english words token, and replying with said tokens

Instruction: Repeat this text exactly as it is
Input: <random word tokens>
Output: <output to benchmark>

In general transformer models when trained to handle this task has no issue with the lookback and providing a full response (within its context length)

The following is the score for raven / custom rwkv4 models

It is important to stress that this should be considered as worse case scenerio memory capacity, as the raven model has been shown to be able to compress down large common concepts into its memories, far exceeding these numbers.

Randomized text was intentionally chosen, to represent worse case numbers, as training cannot help form a pattern for these text

#

Subsequently with a standardised benchmark we have internally, we came up with the means of training the model from scratch, and to replicate the results - without needing to train an entirely new PilePlus+Raven model

This allowed us to perform experiments into improving memory capacity. The biggest impact as of now is the change to the channel-shift layer, in how tokenshift is done, into a structure that resembles a wavenet. (this is only a few line of changes)

Where we now have a TokenShift 430M model that out performs the raven 14B model in memory recall task - further more this is shown to be scalable upwards, with our TokenShift 1.5B model doubling the 430M performance

#

#

We are training a slightly larger model (L24-D5120), which we believe will be able to retain into memories more then a 1k tokens. Making this within transformer level context sizes.

It is believed that these modification to raven 14B, would allow it to have perfect recall of 4k tokens (or higher)

(Currently our experiments are bottlenecked by our GPU capacity)

#

We posit that this heightened perfect recall of token memory, at par with transformers context length, could remove the last obstacles preventing RWKV (or other RNN like) architectures from superseding transformers without any compromise.

As it fixes the last set of tasks that it loses out to transformer models in

#

Notes:

The tokenshift memory models trained, have very limited general text model training, we do not know as of now if this process will benefit or hinder subsequent model performance in other task if trained on the pile + etc - the assumption is that it will be an overall benefit. Changes was only done to channel mixing, with time-mixing kept the same. Which we believe will help it retain existing reasoning capabilities.

Since blinks upcoming RWKV-5 changes is only done on the time-mix layer, these changes could potentially be merged and used together.

Currently we do plan to perform memory training and testing on the time-mix rwkv-5 changes, without the tokenshift changes - and subsequently with

#

we drafted the following abstract, and since the members involved have limited to no experience with papers - nor the GPU capacity to take this idea further then memory training (ie. pile+, and instruct tuning)

here i am 😅

void quartz Jul 24, 2023, 9:38 AM

#

(wavenet architecture, on how the token information flow through the layers)

misty cedar Jul 24, 2023, 9:40 AM

#

the change is to swap from this causal convolution structure here, to the dilated wavenet above

#

at least for the first 12 layers

void quartz Jul 24, 2023, 10:04 AM

#

( @misty cedar was the one who made the bulk of the code changes, for these advancements 😉 )

hushed flare Jul 24, 2023, 11:59 AM

#

Does the wave net still have an RNN form?

misty cedar Jul 24, 2023, 12:03 PM

#

hushed flare Does the wave net still have an RNN form?

Yes

#

We have rnn inference code allready written for it

hushed flare Jul 24, 2023, 12:05 PM

#

misty cedar We have rnn inference code allready written for it

Is it public or are you waiting to publish before sharing it? I'm curious how you turned a CNN structure into a single state

misty cedar Jul 24, 2023, 12:10 PM

#

hushed flare Is it public or are you waiting to publish before sharing it? I'm curious how yo...

It's public, you can find details in the rwkv discord,
Basically, you just have an array for each layer that is of shape ( 2**layerID, dims ), you swap out your state object with the last item, and then do a roll on the array

hushed flare Jul 24, 2023, 12:23 PM

#

void quartz Subsequently with a standardised benchmark we have internally, we came up with t...

During your training and task, are you using random offset and window size before asking it to recall? If the window and offset are fixed, it's easy to learn to do this even with a 1L long conv and it doesn't generalize.

void quartz Jul 24, 2023, 12:25 PM

#

all the code and the notebook is currently public 👍

void quartz Jul 24, 2023, 12:25 PM

#

hushed flare During your training and task, are you using random offset and window size befor...

we train and test recall task from 2 token all the way to 1000 token

hushed flare Jul 24, 2023, 12:25 PM

#

void quartz all the code and the notebook is currently public 👍

Is there link?

void quartz Jul 24, 2023, 12:27 PM

#

For the tokenshift varient all the notebook for the runs can be found here : https://github.com/PicoCreator/RWKV-LM-LoRA/tree/rwkv5x-tokenshift-exp-A/notebook/experiment/tokenshift-exp

For tokenshift model C itself its here: https://github.com/PicoCreator/RWKV-LM-LoRA/blob/rwkv5x-tokenshift-exp-A/notebook/experiment/tokenshift-exp/TokenShift-C/TokenShift-C-mem-finetune.ipynb

(apologies if the experiments are mislabeled here and there due to copy and pasting)

#

For all the benchmark data, including charting, it can be found here : https://github.com/PicoCreator/RWKV-LM-LoRA/blob/picocreator-memory-experiment/notebook/experiment/memory-bench/Charting-benchmark.ipynb

#

abit harder to explain then the simplified table shown : The following scores the output, for each model - x is the input token size that was tested, y is the score (0 means perfect recall)

the table selects the best score based on their respective criteria (among the various tested prompt length)

misty cedar Jul 24, 2023, 12:35 PM

#

layers above 12 are normal unaltered rwkv layers

hushed flare Jul 24, 2023, 12:38 PM

#

You may be interested in this paper, seems similar to what you've proposed: https://arxiv.org/abs/2305.01638

The reason I'm asking all these questions is I'm playing around with it.

arXiv.org

Sequence Modeling with Multiresolution Convolutional Memory

Efficiently capturing the long-range patterns in sequential data sources
salient to a given task -- such as classification and generative modeling --
poses a fundamental challenge. Popular approaches in the space tradeoff between
the memory burden of brute-force enumeration and comparison, as in
transformers, the computational burden of complica...

void quartz Jul 24, 2023, 12:47 PM

#

hushed flare You may be interested in this paper, seems similar to what you've proposed: http...

Nice to see multi-channel proven here for audio (an experiment we planned to try next in pipeline) 😆

outer vine Jul 28, 2023, 8:24 AM

#

wow, this is crazy, a linear attention-based model with 175B parameters, which could been trained in parallel and do generation recurrently
https://arxiv.org/abs/2307.14995

arXiv.org

Scaling TransNormer to 175 Billion Parameters

We present TransNormerLLM, the first linear attention-based Large Language
Model (LLM) that outperforms conventional softmax attention-based models in
terms of both accuracy and efficiency. TransNormerLLM evolves from the previous
linear attention architecture TransNormer by making advanced modifications that
include positional embedding, linear...

young sparrow Jul 28, 2023, 5:52 PM

#

outer vine wow, this is crazy, a linear attention-based model with 175B parameters, which ...

Unfortunately there’s no evidence in this paper that it actually works

#

The only evaluations they do are of partially trained models with 1B parameters or fewer

outer vine Jul 28, 2023, 7:28 PM

#

young sparrow The only evaluations they do are of partially trained models with 1B parameters ...

argreed. The evaluation setting is weak for now. But I heard from some insider that the larger model is under training now.

young sparrow Jul 28, 2023, 7:35 PM

#

outer vine argreed. The evaluation setting is weak for now. But I heard from some insider t...

Call me crazy, but I would train a big model and test its performance before telling the world I made a breakthrough

obsidian quest Jul 29, 2023, 3:36 AM

#

RWKV-5-World-0.1B-v1-OnlyForTest_37%_trained-20230728-ctx4096.pth uploaded https://huggingface.co/BlinkDL/rwkv-5-world/tree/main
supported in rwkv pip package 0.8.7

0.1B world:
RWKV-5 37% trained = LAMBADA ppl 18.1 acc 42.93%
RWKV-4 100% trained = LAMBADA ppl 25.5 acc 36.29%

Interesting fact: RWKV-5 is great at benchmarks (excellent zeroshot performance), but generates quite worse music (just like GPT models) despite lower loss. (try https://huggingface.co/BlinkDL/rwkv-5-music)

This fits my theory: Dot-product is good for uncreative work, while Channelwise is good for creative work.

BlinkDL/rwkv-5-world at main

young sparrow Jul 29, 2023, 4:08 AM

#

obsidian quest RWKV-5-World-0.1B-v1-OnlyForTest_37%_trained-20230728-ctx4096.pth uploaded https...

Do you have RWKV numbers on the Long Range Arena? I’m interested in comparing RWKV to the Hrrformer (code, paper)

obsidian quest Jul 29, 2023, 4:15 AM

#

lets ask @sullen horizon #1103039376184852622 message

void quartz Jul 29, 2023, 5:13 PM

#

young sparrow Call me crazy, but I would train a big model and test its performance *before* t...

didnt realise they only trained 1B models so far

isn't this the same as setting up a ~200B RWKV model, and training for 1 step, and projecting the rest of the loss line 🙈

young sparrow Jul 29, 2023, 5:14 PM

#

void quartz didnt realise they only trained 1B models so far isn't this the same as setting...

It’s a little less egregious, but still obviously bad

void quartz Jul 31, 2023, 9:41 AM

#

void quartz # RWKV memory experiment v5/wavenet - update 1 While RWKV is able to match tran...

RWKV memory experiment v5/wavenet - update 2

(please let me know if i should shift this into a seperate thread)

We done a 7.5 / 15% codeparrot dataset train, on both baseline rwkv4 code, and rwkv4+tokenshift to see if the changes have negative impact on the model capability in other task. All 3 are the same 1.5B param models param

From a loss point of view all 3 models converged into similar loss levels, indicating that the token shift changes may not have adverse negative impact on other task

It is also interesting to note that the codeparrot model itself, had an average loss of 2.06 against its validation dataset - meaning all 3 models despite being trained significantly less - may outperform the codeparrot model

Asking for feedback on how to move these changes / experiments forward - aka what are good evals / tasks to train / validate on which would make good use of the extended memory - ideally without needing to train a full model

obsidian quest Jul 31, 2023, 10:01 PM

#

and you can use https://huggingface.co/datasets/bigcode/starcoderdata @void quartz

bigcode/starcoderdata · Datasets at Hugging Face

hushed flare Aug 1, 2023, 12:21 AM

#

@obsidian quest https://github.com/microsoft/torchscale/commit/bf65397b26469ac9c24d83a9b779b285c1ec640b

GitHub

RetNet · microsoft/torchscale@bf65397

#

This looks so much like your code, no? (the RNN form)

obsidian quest Aug 1, 2023, 12:40 AM

#

AFT = headsz 1 version of LinearTransformer

RWKV4 = ExponentialDecay + Headsz 1

RetNet = ExponentialDecay + Headsz 256, with xPos too (but I find it can be removed)

RWKV5 = ExponentialDecay + Headsz 64, best performance

Headsz N = N x larger state (more vram, slower, still much better than KV cache), helps memorization

#

#

@hushed flare see #1103039376184852622 message

hushed flare Aug 1, 2023, 12:50 AM

#

obsidian quest

I'm surprised there's so much of a difference -- I would have thought if you heave head size = 1 then you'd end up with multiple heads learning the same kernel if it was beneficial

obsidian quest Aug 1, 2023, 12:54 AM

#

0.1 loss difference is not too much for a small model. it's similar to the loss of a 30% larger model

void quartz Aug 1, 2023, 2:08 AM

#

obsidian quest and you can use https://huggingface.co/datasets/bigcode/starcoderdata <@64442830...

yea i think having a a fully trained pure code model based on RWKV might be of interest, especially if we can validate that it does better then existing model

(the original codeparrot was pure python, easier to test with limited training)

young sparrow Aug 1, 2023, 2:13 AM

#

void quartz yea i think having a a fully trained pure code model based on RWKV might be of i...

We’re currently making an improved reprocessing of the StarCoder dataset that can be used for this. But TBH I think code + language is probably better for most things than mono code

void quartz Aug 1, 2023, 2:36 AM

#

young sparrow We’re currently making an improved reprocessing of the StarCoder dataset that ca...

true, considering prompt to code is a large use case

if this is a use case in which RWKV can shine above existing models, it would be a huge step forward for the architecture recognition to the wider audience
(i think this issue is now more of a GPU compute issue to train such a model)

young sparrow Aug 2, 2023, 2:10 AM

#

Great work, y’all’re getting noticed 🙂

#

At ICML I mentioned RWKV a couple times in a couple conversations and a bunch of people I was talking to knew about it

snow zealot Aug 5, 2023, 7:23 PM

#

young sparrow Do you have RWKV numbers on the Long Range Arena? I’m interested in comparing RW...

Just out of curiosity I trained the RWKV V4 on the code, https://github.com/SSamDav/rwkv-long-range-arena/tree/main, from @sullen horizon on 3 LRA benchmarks (listops, imdb, aan) here, https://wandb.ai/ssamtheboy/lra-benchmark, are the results.
The listops results are a bit sketchy, because I had a run yesterday that performed much better. Probably I need to change the default parameters.
Each run has a note saying each dataset it corresponds.

W&B

ssamtheboy

Weights & Biases, developer tools for machine learning

void quartz Aug 9, 2023, 3:54 PM

#

void quartz Not sure whats the procedure for paper feedback / corrections is - rwkv is cited...

Gotten someone from the RetNet side to clarify what they meant about RWKV not being "training parallelization".

It basically boil down to the fact we need to compute the previous tokens state for the next token state in a data sample in a sequence. And nothing to do with GPU usage - which to be fair - is very true

The full convo is on github here https://github.com/microsoft/unilm/issues/1243

But i hope that helps clear the air on that topic

Screenshot_2023-08-09_at_11.51.08_PM.png

#

(so put down those pitchforks folks)

spiral minnow Aug 9, 2023, 5:53 PM

#

Still seems like they are bending the definition. They're method also uses recurrence within blocks and within those blocks, the computation cannot be parallelized (as they define it). So, in my understanding, their method also needs to compute the previous token state in order to calculate the current token state, and thus is also not parallelizable according to that definition

obsidian quest Aug 9, 2023, 5:56 PM

#

spiral minnow Still seems like they are bending the definition. They're method also uses recur...

exactly

void quartz Aug 10, 2023, 12:34 AM

#

spiral minnow Still seems like they are bending the definition. They're method also uses recur...

😅 yea im still kinda off on the definition.

Cause while it's true that we do not need to have "state" precomputed in transformers.
Dun i also need to compute all the previous tokens in parallel and apply attention, even more so uniquely for every token i generate. Somehow all that additional compute cost is better than having a state between tokens?

(not the expert here, so im gonna take the explanation that its not about throughput as it is)

void quartz Aug 10, 2023, 4:27 AM

#

void quartz # RWKV memory experiment v5/wavenet - update 2 (please let me know if i should s...

RWKV memory experiment v5/wavenet - update 3

Continuing the series of stress testing the v5 (rotary embedding) and v5+wavenet changes - for memory storage and capacity of random words.

We have now officially passed the 1k token mark, with the 1.5B model able to keep upto 1.7k tokens in memory.
Because this is near the current training limit (of 2k), it is possible that the real limit is higher.

Wavenet preview is only 75% trained compared to baseline v5, however its on track to similar performance range (currently 1.5k from testing of the preview - vs - baseline 1.7k)

Tune5 (for both models), which will train it with up to 4k inputs/outputs, is estimated for 48+ more hours

#

However the big thing is, putting the technical progress aside...

RWKV-v5 with or without wavenet, is now officially in transformer territory range of being able to lookback into its inputs

(and hopefully pay attention to them too! which we believe it should, as these changes show no penalty in enwiki/code loss training compared to v4 - once proven, this brings us much closer to having RWKV being a full replacement to transformers with no compromises)

void quartz Aug 10, 2023, 4:44 AM

#

This is also strong evidence (v4 vs v5) of rotary embeddings, being able to encode and handle relative positional information

snow zealot Aug 10, 2023, 8:10 AM

#

The problem of using the rotary embedding is then we loose the inf context no?

young sparrow Aug 10, 2023, 11:05 AM

#

void quartz This is also strong evidence (v4 vs v5) of rotary embeddings, being able to enco...

I would hope so! That's what they're designed to do 🙂

void quartz Aug 10, 2023, 11:59 AM

#

snow zealot The problem of using the rotary embedding is then we loose the inf context no?

not sure how to fully explain blink changes, its rotary + timemix still, so infctx still works

#

there is no absolute positional encoding in it

void quartz Aug 11, 2023, 2:42 PM

#

( apologies for confusion, blink basically explained to me how v5 changes were not rotary, so it was a misunderstanding on my part when i visualised how the model changes worked - still the numbers are as benchmarked )

obsidian quest Aug 12, 2023, 9:29 AM

#

yeah my v5 implementation does not have rotary, nor xpos 🙂 it's pos.emb-free

tender karma Aug 12, 2023, 3:13 PM

#

obsidian quest yeah my v5 implementation does not have rotary, nor xpos 🙂 it's pos.emb-free

Do you have any description of the current implementation v5? I’ve read so many things that I’m a bit confused. If you confirm the link to the actual code I can also figure it out by myself 🙂

obsidian quest Aug 12, 2023, 3:20 PM

#

everything here #1103039376184852622 message
standalone implementation https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py
(compare with https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py)

tender karma Aug 12, 2023, 3:41 PM

#

Thank you! Do you consider it a stable v5 or is really much under experimentation?

obsidian quest Aug 12, 2023, 6:17 PM

#

stable. can still be improved a bit but requires another CUDA kernel

tender karma Aug 12, 2023, 6:49 PM

#

obsidian quest stable. can still be improved a bit but requires another CUDA kernel

Perfect, it means that I can implement the v5 to rwkv.f90 (the port in Fortran)!

indigo crater Aug 13, 2023, 4:39 PM

#

very late, but: I would be willing to bet this is a LaTeX addon and it should probably be in overleaf somewhere

void quartz Aug 14, 2023, 4:46 AM

#

tender karma Do you have any description of the current implementation v5? I’ve read so many ...

ignore all the wavenet/tokenshift stuff - those are not in v5.

i found this useful for just extracting the delta:
https://github.com/BlinkDL/RWKV-LM/compare/a637aea61c77cedd290054449d819da5e7b19d44...main

GitHub

Comparing a637aea61c77cedd290054449d819da5e7b19d44...main · BlinkDL...

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

spiral minnow Aug 15, 2023, 3:45 PM

#

EMNLP reviews should be coming out in about 1 week. Based on the community's reaction to RWKV, are there any specific feedback that we're expecting from reviewers? Are there any experiments that we can get a head start on now?

young sparrow Aug 15, 2023, 4:00 PM

#

A scaling laws study that's better designed to draw inference about the param-to-token ratio would be a good idea. We did a pretty good job with the time we had, but I haven't had the bandwidth to figure out exactly what we want yet.

karmic tree Aug 22, 2023, 8:26 PM

#

I guess we'll see them closer to end 22nd AOE, rather than daytime on the 22nd 🙂 But hopefully they're not late

sharp sonnet Aug 22, 2023, 8:44 PM

#

The reviews are out!

last mauve Aug 22, 2023, 8:44 PM

#

Reviews are looking positive to me

#

I'll put up a rebuttal skeleton and revision work list later

karmic tree Aug 22, 2023, 10:59 PM

#

R Zd3h's first reason to reject establishes a fine anchoring for the paper, I think; if their complaint is an argument that RWKV is not as impactful as the transformer architecture, then things are going well. Always nice to get an "Excitement: Transformative"

young sparrow Aug 22, 2023, 11:41 PM

#

karmic tree R Zd3h's first reason to reject establishes a fine anchoring for the paper, I th...

We admit that this paper is not likely to be as impactful as the most important NLP paper of the past five years, but we hope the reviewers will not hold it against us too harshly. If in five years time we have 10% the citations as Vaswani et al. (2017), we will content ourselves with merely 9,000 citations.

karmic tree Aug 23, 2023, 12:15 AM

#

young sparrow > We admit that this paper is not likely to be as impactful as the most importan...

And this is without stating where the transformer is weaker than RWKV! I'm not sure the seq2seq transformer would have an easy time through review had RWKV appeared first: data inefficient, fixed context window, very memory hungry,...

gusty condor Aug 23, 2023, 4:40 AM

#

I don't understand. Just reject because it's a non-Transformer architecture?

outer vine Aug 23, 2023, 5:01 AM

#

I kind of agree with reviewer 85wr and it would greatly improve the readability of our paper

#

if we could add RKWV-v1 to this paper, it would be a more smooth transition from AFT to RWKV

#

AFT to RWKV-v1: absolute position score to relative decay score (rnn shows up here)
v1 to v4: single decay score to channel-wise decay (plus a lot more like u vector, init method...)

karmic tree Aug 23, 2023, 5:16 AM

#

gusty condor I don't understand. Just reject because it's a non-Transformer architecture?

It's just one opinion, I don't think it would sink the submission

gusty condor Aug 23, 2023, 5:43 AM

#

karmic tree R Zd3h's first reason to reject establishes a fine anchoring for the paper, I th...

"Transformative" yet not a "Transformer"🤔

fickle hare Aug 23, 2023, 7:41 AM

#

I'm not familiar with AI paper reviewing at all, but the "reasons to reject" section looks like "weakness" in Sys/PL conference reviews to me

#

reviewer HNDB even asked for fp16/bf16 training information in their "reasons to reject" part....

tropic minnow Aug 23, 2023, 10:23 AM

#

outer vine I kind of agree with reviewer 85wr and it would greatly improve the readability ...

Review from 85wr is a bit harsh... will require some work

#RWKV-papers

RWKV memory experiment v5/wavenet - update 1

RWKV memory experiment v5/wavenet - update 2

RWKV memory experiment v5/wavenet - update 3