#RWKV-papers
1 messages · Page 3 of 1
8 GPU for each run
Let’s see how bad the crashes are and I’ll move things around mid week if needed
I believe we need clarify a bit better. We have the RWKV which is general purpose RNN than can potentially replace every LSTM-alike in your projects, describing the neural model (e.g, rwkv cell) without even mention the LM task. Then, we say we focus on RWKV-LM, with the LM on top. As pure RNN I started (and then removed) a fundamental experiment section with the basic stuff to evaluate RNNs such as addition, copying tasks, etc. it performed great of course but the LSTM performed almost equally in such simple tasks so there were no point at the end.
so with whom should I work on reorganize the section 4?
@young sparrow I'd be glad to jump in and help with rewriting an initial draft for Section 4. I've run RWKV a few times, but I've always had some questions about the details. I have a knack for breaking down complex concepts which should be useful here.
Like 👍. Comment 💬. Subscribe 🟥.🏘 Discord: https://discord.gg/uyYQTB7ahttps://arxiv.org/pdf/2305.13048.pdfhttps://github.com/BlinkDL/RWKV-LM#transformers ...
Someone reading iterations the paper
Note where he gets confused/misunderstands stuff
This is an "Open Review"...
I'm halfway through, and I am just noting he is very very confused about what token shift is, so it may be worth elaborating on that ig
He seems very confused throughout 20-30% of the paper
@obsidian quest something seems to be going catastrophically wrong with the runs
killed multiple times every day
but mostly completed now
Happy to work with you on section 4.
Please download all loss curves in https://wandb.ai/blinkdl/RWKV-v4-Scaling
Use n_embd n_layer my_exit_tokens to identify the run & combine fragments
@bronze frost Hey, do you think this exp in the cuda backward kernel is not numerically stable:
This is not a problem if we accept small exponentials to be inaccurate in the gradient, since both exp(zexp[i]) and exp(k[i] + o) are less than 1.
Going to start making some changes to section 4 for clarity. Let me know if you have any suggestions or concerns.
yes, basically if exp(k[ii]+o) is too small for floats to represent, then the outputted gradient gk[ii] isn't going to have any luck representing it either. So then it's fine to underflow to 0.
Do we need to clarify the definition for token shift?
Yes, we never explain what it means and it's a non-standard term AFAIK.
👌👌
As mentioned above, Section 4 seems to need a rewrite instead of just linguistic improvement. Maybe we should decide the structure of the section first before moving into the details.
Hello, I would like to extend some help to revise the paper.
Here are some of my immature suggestions. Please correct me if missed some details already covered in the paper.
-
Lack of explanation for scalability:
We mentioned that RWKV can scale to tens of billions of parameters, but it is not clear how this scalability was achieved.
We could provide more details about how we optimized the model architecture and training process to achieve such scalability. -
Insufficient analysis and visualization of attention weights:
While section 4 provides some insights into the interpretability of RWKV's attention mechanism, it would be helpful to include a more detailed analysis and visualization of the attention weights.
We could include visualizations that show how attention weights change over time or across different layers in the model.
- We wanted to demonstrate the training scalability similar to transformers in section 4.2, but it seems it's too implicit right now. It's planned to rewrite the whole section so I'll remember that, thanks.
- What do you mean by "attention weights"? The decay introduced by
Ws?
On it
For attention weights, I mean the coefficients assigned to different input elements by an attention mechanism. If I remember correctly, by assigning higher weights to more important elements, the model can selectively attend to the relevant information and ignore the irrelevant parts.
i think definition of recurrence both as token-shift and as an increasingly longer sum of terms should be mentioned the first time recurrence is described. If anything, i think the WKV is more important for recurrence than token-shift, which is an extra tiny convolution
So you mean the attention map, consisting of n-to-n numbers per attention head? While we can produce an equivalent plot, it might be less meaningful for variants of linear attention... IDK
@obsidian quest can you add me to this so I can edit it in app?
Kind of heatmaps. Not sure if it is suitable. NVM.
@fickle hare I made revisions for section 4. Here's the change log.
- Fixed some typos in sections 4.4 and 4.5.
- Revised sentences in sections 4.2~4.7.
- Change the title of section 4.4 "Software Implementation" to "Model Implementation and Architecture".
Suggestion for Figure 2: The font is too small. If the author can provide the original design file, I can help to revise it.
@tender karma @serene badge @fickle hare @neon night @uneven blade
Looks like everyone really loves section 4. For the rewrite, here's a summary of all the points brought up so far + my own thoughts.
- "Infinite" context clarification (1-2 people)
- (Main paper) We should show a math proof, a graph, or at least talk about how this is supported
- (Appendix) @obsidian quest or anyone has time to finetune a 7B model with larger context length and just compare it with other models such as MPT-7B-StoryWriter-65k+, this would be extremely helpful
- MPT finetuning dataset here. They used a filtered fiction subset: https://huggingface.co/datasets/the_pile_books3
- Moving definitions into appendix (1 person)
- We are explaining quite a few things that we could move into the Appendix to save space for more important points
- Design clarifications (1-2 people)
- (Main paper) Learning rate, hyper-parameters, optimization techniques
- Expand on the usage of recurrence, time decay, and token shift.
- (Appendix) Elaborate
- Editing and coordinating (1 person)
- Review and edit the final work to ensure it flows well.
- Fix abrupt transitions into new concepts.
- Remove repetitive statements.
@steady ether you seem to be confused about the_pile_books3. That’s not the MPT training dataset, it’s a small component of it. It’s also a component that is already in our training corpus
@steady ether, for the definitions and Design clarifications, I'm thinking that we could use a summary table for all the key features we implemented in RWKV. The format could be like this. Then we could move some definitions or explanations to the appendix section.
You're totally right. Looks like MPT was fine-tuned on the subset of the books3 dataset, but the base was trained on Pile v2. Edited the post for clarity
That's a neat idea. This would certainly help clarify a lot of misunderstandings, especially by people who only kind of understand what's going on. We should aim to make this as easy to understand as possible.
However, it's a big change and we should have buy in from @last mauve, @tender karma and others who have worked on that section
No? The base was not trained on Pile V2
Please read what you’re linking to before making claims about it
It was trained on 1T tokens of text and code that was curated by MosaicML’s data team
Sorry, yeah you're right. Posting the real data mix here in case I mislead anyone with my previous comment: https://huggingface.co/mosaicml/mpt-7b
7B model long context finetuning need some work in the code, e.g. splitting sequence to multiple GPU
I can work on the code but I'm not sure if I'll have the GPU-hours to fine-tune on that.
Some thoughts:
- In "Transformer-like Parallelization," we want to mention the following:
a. In our training process, most of the computation (which includes all the matrix multiplication and token shift, only excludes the WKV recurrent operator) is parallelized in the time-axis, similar to Transformers/QRNN/LRU/... but different from GRU/LSTM/...
b. The WKV operator has the potential to be parallelized as well through parallel scan (If the long context finetune is accomplished later, it will become "have been" instead of "can be") - In "RNN-like Sequential Decoding," maybe more explicitly compare with the KV cache of Transformers? Instead of "Sequential," we may want to highlight more about the constant time & space despite the sequence length in the subsection title.
Agreed, especially with #2. We should elaborate and also add some citations here.
@obsidian quest Which checkpoint should I start with if I want to replicate the MPT-7B-StoryWriter-65k+ finetune on our 7B? RWKV-4-Pile-7B-20230406-ctx8192-test949.pth?
yes
@steady ether @fickle hare, I’ve added some citations to section 4.2.
Just to close a topic.. I found the following form useful in future extensions of RWKV & relatively easy to compute:
For the paper, it probably helps to mention the word "cumulative sum"
Is it called "time span decayed" cumsum ?
Mind-blowing. If RL is similar to WKV, then a whole bunch of RL techniques can be applied... anyway that's another issue, you write the paper you like
Exactly !! I'm thinking the same formula 🤣 🤣
One way of calling it is "cumulative attention weight / attention value (?)", similar to "cumulative reward" in RL.
umm, perphaps, we have to have an assumption similar to "RL with infinite horizon" convergence for working "RWKV with infinite context length"
If I understand correctly, I think that it's like dynamic programming (time-memory tradeoff using KV cache) vs constant memory transitions (RNN like)
wkv can be seen as maximizing the normalized reward in the direction of the output label
I think this is only true when gamma is an unlearnable hyperparameter
Only if you take FlashAttention into account... otherwise KV cache is not using any more memory than directly computing the matmuls
neutrally, wkv could described as a weighted cumulative sum of latent vectors competing for attention, with a learnable exponential decay factor that favors recent vectors.
@obsidian quest I'm implementing long context training with time checkpointing now, and I need some hints around the L2Wrap thing. It seems to be manually scaling the largest element in each token's output logits, in which the scaling factor is related to the total token amount B*T. Should I keep scaling according to the total token amount, even if it would be much larger (~100K-1M, compared to the previous 10Ks) than before?
Personally, I don't think that techniques like double network will work for this, because w is changing over time yet target network is fixed for some time, but you can do some experiments
I hadn't noticed the L2Wrap before, it looks like it's making the backward pass more numerically stable by down scaling or something? how are you implementing the long context training? are you folding it into batches and distributing across gpus?
No, I'm not. Given the limited resource, I decided to do gradient checkpointing for every subsequence and chain them together. 4~8*80GB VRAM won't enable 100K~1M ctxlen I want.
yes that's an issue. but if we can find ways around this, we can produce more papers
by fixing the time decay factor (at the fine-tuning stage) for example
The L2Wrap seems to have not been mentioned in the paper either 🤯
it returns a gradient to make max(logits) closer to 0. the gradient is already scaled
It's from PaLM paper (section 5)
@fickle hare I've just revised section 4.4 for clarity.
Could you help clarify these points in section 4.1?
-
On what basis can we guarantee that linear interpolation will be beneficial in this context?
-
I noticed the weight output is denoted as
Wo. Do you think it would make sense to rename it toWwfor consistency with the RWKV naming?
As per token shift, its benefit is a nontrivial one. In the Hungry Hungry Hippos paper https://arxiv.org/pdf/2212.14052.pdf, they design a "shift matrix" that makes "the state x_i to copy from the input u_i, and then pass that information to the next state x_i+1". They do an experiment of Induction Head showing their architecture narrows the gap between transformers on this task.
We can do a similar experiment to show this: whether a 2-layer RWKV with/without token shift is able to learn the Induction Head task in 100% accuracy.
- If your mentioned "linear interpolation" means the interpolation between $x_t$ and $x_{t-1}$, please refer to the above notes by @uneven blade.
- W in RWKV is the decaying parameter in (14), while the $W_o$ here is a weight to linearly project to an output. IMO it should not be renamed to $W_w$.
Blealtan | Huanqi Cao
it is likely we might have to cite this: https://arxiv.org/abs/2305.19370 (block-parallel transformer, twitter thread came out today) as it is a development on top of memory-efficient attention we already cite (raabe & stats 2022...) with applicability to extend context a lot (up to 64k in the paper)
Transformers have emerged as the cornerstone of state-of-the-art natural
language processing models, showcasing exceptional performance across a wide
range of AI applications. However, the memory demands posed by the
self-attention mechanism and the large feedforward network in Transformers
limit their ability to handle long sequences, thereby c...
I've done lots of experiment with the token shift. My main takeaway was that it's playing an important role in token mixing, less so in channel mixing. It's effect in model performance is also non-trivial, in a way that's different for r, k and v. I think it's effect in v has a clean interpretation, but not so for k since it lives in the exponent... The shift could be considered as a tiny convolution layer with kernel size 2 and a softmax (only valid when mixing coeff is positive), when extending it to larger kernel size, i found that it actually made the model more confused than being helpful I think due to these non-trivial effect. If we are doing more experiment, It'll be good to crosscheck these observations...
visualize it for K V R and different layers and you will see patterns
larger kernels sz can be useful for byte-level / char-level / audio modeling
Popped up on my recommended: https://youtu.be/x8pW19wKfXQ
#gpt4 #rwkv #transformer
We take a look at RWKV, a highly scalable architecture between Transformers and RNNs.
Fully Connected (June 7th in SF) Promo Link: https://www.fullyconnected.com/?promo=ynnc
OUTLINE:
0:00 - Introduction
1:50 - Fully Connected In-Person Conference in SF June 7th
3:00 - Transformers vs RNNs
8:00 - RWKV: Best of both wo...
cool~
@young sparrow @obsidian quest how are experiments for scaling laws going?
he has got good LRA numbers and tuning for better
@steady ether I just went through 4.1 and left several comments there. I feel that reorganizing this subsection is really necessary: it basically mixes all architectural designs in a number of paragraphs without clear sectioning. It should be split into several parts, including 1. *former overall architecture, 2. token-shift for both time & channel -mix, 3. output gating for both time & channel -mix, 4. WKV.
Besides, I somehow feel that the current writing is still not perfect, maybe after another editing pass we need to call for others' help
Seems it's time to split section 4 into multiple sections...
Good, I’ll have plots Monday
BTW I also remember people commenting on our ArXiV paper about lacking ablation study on the different techniques, including token shift, introducing u in WKV, softmax (exponentials) in WKV, etc.
Do you think doing a series of small experiments like small init would help?
IDK, I'm in no way familiar with ML research 
(I major in HPC and never really worked on a ML paper like this)
Yes
Not essential, but it would be a nice to have
Absolutely. I do have some major changes saved locally which I'll cleanup and share an update today or tomorrow.
I've been looking at those 2 youtube review videos to better understand what people are confused about.
i would put WKV right after overall architecture. then token shift
i can revise it anytime you want
was just mentioning the necessary bits, not in specific order
I'd like to help with the revision of the section. Will put effort into it.
@fickle hare @serene badge @tropic minnow
Just reworked 4.1. Let me know if it makes more sense now
Will revise Figure 2 to increase the font size today.
It's better, but I think we can do sth more. How about this:
4.1 Architecture Design: Overview of the *former-like structure, residual, time-mix, channel-mix; the end-to-end figure
4.1.1 WKV Operator: describe the attention formula of WKV; mention the existence of recurrent form; insights from AFT, linear RNN, etc.
4.1.2 Token Shift
4.1.3 Output Gating
BTW the current 4.2 and 4.3 is too fragmented in the whole paper IMO, should think about put them elsewhere, e.g. in the (new) WKV operator subsection
Sounds good. I was a bit hesitant about adding new subsections because of the page limit. Do you think it's worth shortening 4.7 Additional Optimizations or moving it to the Appendix?
I think it's worth eliminating 4.2 and 4.3 if we can get the overview to contain necessary information😂
4.7 also contains some redundant parts I think
the two arch figs and numerous arch formulas might also be unnecessary IMO
Makes sense, I feel like 4.2 and 4.3 only existed to emphasize that RWKV has "the best of both worlds"
i think the CUDA kernel paragraph is repeated with a sentence of time-parallel mode. so these kind of repeats could be dedupped to shorten text. also, i would first try to put all the information we want there. we can always come back and shorten / make things more concise
After digging into section 4.1, I began to realize that the order of content might be confusing for some readers. We delve into intricate details and then seem to revert back to higher-level concepts.
If we're open to renaming the headers "RNN-like" and "Transformer-like" to something else. I think we can consider the structure in the 2nd image.
would move eqs 16,17 with 12,13,14
i think i like the titles from image with 4.1.1 etc better - they are more objective descriptions and less subjective claims about potential applicability/intention
ctx: this one
What is the point of Figure 1?
I like calling out “transformer-like” and “RNN-like” explicitly
The thing that strikes me as weird in the current Section 4 is that “Software Implementation” should probably come last
It also sorta feels like 4.2 and 4.5 should be combined, or at least consecutive?
The section currently is not systematic. It probably doesn’t matter that much what order we go over the material as long as there’s a clear systemic organization
- RWKV
4.1 Architecture
- Keep current content
- Compress “4.5 Gradient Stability and Layer Stacking” into a single paragraph and stick it here.
4.2 Transformer-like Training - Keep current content
4.3 RNN-like Inference - Combines “4.3 RNN-like Sequential Decoding” and “4.6 Harnessing Temporal Structure for Sequential Data Processing”
4.4 Additional Optimizations - Keep current content
4.5 Software Implementation - Add a couple mixing citations, such as to DeepSpeed
We also need to add the basic info about how the model is trained that is currently missing, like talking about LR decay and providing the h params. That can maybe go in between Sections 4 and 5 along with the scaling laws stuff?
Fully agree
Did a high level re-org of those sections and also added citations for DeepSpeed, ZeRO, Megatron-LM. Also addressed comments from @tropic minnow
It looks like the diagram has an error: there’s an extra layer norm at the very beginning
@serene badge You mentioned earlier that you were going to update Figure 2, could you include this?
It might also be clearer to define $\tilde{x_t}=\mu x_t + (1-\mu)x_{t-1}$ and do Eq 12-16 in terms of $\tilde{x_t}$
Stella Biderman (she/her)
Well, I guess that $\mu$ is different between $r/k/v$
Stella Biderman (she/her)
Sure, will update that.
@fickle hare I haven't really looked at the base code since March. would you mind writing a short paragraph on learning rate/hyperparameters/optimzers in section 4.5. Nothing fancy, just how things are set up right now. We will polish it up later.
A few points I remember that could be relevant
- There were some issues with model divergence when we upped the context length, right?
- Something about channels decaying at individual rates based on learned weights and activation during inference
- There were discussions on LAMB being an possibility, but probably won't be a game-changer. I can't recall the exact reasons though.
@steady ether I have adjusted the font size and changed it to PDF for Figure 2.
Nice! Could we also address Stella's comment on the extra layer norm?
It looks like the diagram has an error: there’s an extra layer norm at the very beginning
Do you mean the extra layer norm after Input embedding in the right figure?
I think it's both that, and also the ones in figure 3. We'll have to address both of these.
OK. I've removed that layer norm in Figure 2. Will revise Figure 3.
I have revised Figure 3 to remove the extra layer norm.
To compare parallelized RNNs like RWKV with cell based classical RNNs at the related work section.
Then it could be moved to otherwhere?
I'm not really familiar with that either, though I did read the corresponding codes. Will need Bo to double check that once I finish the initial draft.
addressing your mentioned points:
- I'm not aware of that, need to ask someone else
- What do you mean by this? The channels' decaying rates are trained in the WKV operator, it should have been covered in 4.1
- I don't even know what's LAMB 😭
RWKV has an extra layer norm after embedding. it's part of [small init emb] trick
@obsidian quest would you please provide the hyperparameters for training on pile? I'm adding the learning rate/optimizer paragraph.
adam 0.9 0.99, no weight decay, no dropout, bsz 128
What about the lr_init, lr_final, my_pile_edecay and warmup_steps? I see these are deciding the LR schedule through rather complicated logic.
may i ask where was the extra layernorm in the figure that is not in the code?
this applies an extra layernorm at the very beginning: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py#L307
so the first block would have an extra layernorm. correct me if im wrong
if im correct, changes to figures should be rolled back. otherwise, at least fig3 needs a fix here:
the small init embedding (embeddings to 1e-4 and LN afterwards - (then whatever residual blocks) is also described here: https://github.com/BlinkDL/SmallInitEmb#smallinitemb)
@steady ether @serene badge @young sparrow ^^
is this really the case? I don't think the current training code implements multi-gpu tensor parallel as Megatron did
@last mauve 👀 👀 👀 i think accessing the full history can have many advantages for resolving unexpected changes and tracking progress over time. could we have it? if you dont want to spend money on this, i think @young sparrow had the paid version so we could transfer ownership. also could we add tracking changes to see who's the author of what modifications?
yea i dont think we use model-parallel
Oh wild
Yeah if you transfer it to me that would work
warmup steps ==> 10 steps, only because i am not saving optimizer state
Why are you not saving optimizer states
ok im reverting changes to fig2, 3. @serene badge im using a larger font size for fig2 as u did.
It’s awkwardly situated then, given that it’s pretty far away in the text from the related work section. I also suspect we’ll need to cut of substantially decease that section for length considerations in the end.
not enough fsx space lol
i find it's fine
fig 1 is also refered in section 3.1 to compare RWKVs with RNNs. Do you wanna remove texts about comparisons with RNNs in related works and section 3.1 ? Do you wanna remove only fig 1 because of its space consumption?
Thank you for the update. I should check the code before editing. The font size looks better now.
And yes, if you transfer ownership to me we will get full project history
My issues with it are (in decreasing order of importance):
- It doesn’t add anything to my comprehension and others have said the same
- It’s specially disconnected from its references which makes reading the paper harder
- I think it makes the aesthetics of the page layout worse.
Separately, I anticipate needing to cut it for space concerns
My issues with it are (in decreasing order of importance):
- It doesn’t add anything to my comprehension and others have said the same
- It’s specially disconnected from its references which makes reading the paper harder
- I think it makes the aesthetics of the page layout worse.
Separately, I anticipate needing to cut it for space concerns
I wonder if you want to remove some paragraphs or sections. I agree with removing fig 1.
I want to move section 2 to the appendix, pretty much. Maybe incorporate some of its contents elsewhere
this must be disclosed in the paper
because all runs are killed multiple times due to server issues
Do others say to move section 2 to the appendix?
yea project is not mine, i think it's @last mauve 's
That’s fine, but this must be disclosed in the paper
And makes me worried that there are other things that need to be disclosed in the paper that I haven’t caught yet
i don't think we should move the whole of section 2 to appendix. the works described there can be very relevant to readers as they share common objectives with ours. perhaps we could simplify it or move the less relevant part. There's also some work of deduplication to be done, for example this sentence (^attached^) which should go in 3.2 at least (just moved).
i agree w @young sparrow on moving figure 1 out of the current place (and placing in appendix or hiding completely,). It's odd the first figure of a paper introducing a novel architecture adds so little to what this arch really is. Especially when figure 3 for example would be much more pleasant to the eye and help a lot more to understand whats RWKV.
I think that some of this content should be viewed as essential, but can easily be moved to the introduction or another section. Also, the passage you highlight is already in Section 4
@obsidian quest So I’m visualizing the data for the scaling laws
And I can slice the data by model size
But how do I distinguish between runs that ran for different numbers of tokens?
it is likely... lets continue inspecting things carefully and reporting missing information. ideally we would like all experimental results to have reproducibility instructions and potentially open source code.
ping @paper dove
I think we still need the following:
- Mention hardware info for reproducibility
- Maybe an Ethics Statement?
- Guidelines (https://2021.emnlp.org/call-for-papers/ethics-faq)
- Mention misuse potential
- I think we fall under "experiments that involve lots of compute time/power"
- An example from another EMNLP paper: https://aclanthology.org/2022.emnlp-main.42.pdf
thx. yes will be working on a draft for that later today
Sorry for the radio silence. Had another paper deadline.
@young sparrow -- Sent you an overleaf invite. Once you accept I can promote you to owner
Use n_embd // n_layer // my_exit_tokens to identify the run & combine fragments
What is my_exit_tokens? Is that the target train length?
Done
Transferred
target train length. you can filter by it in wanbd
Perfect
Removed
@fickle hare Is this accurate? Maybe worth using more precise language and also a mention in your paragraph.
I thought we initialized most of the matrices to zero (at least in the March version)
I guess zero is a small value, huh? 😄
Nevermind, I was looking at the wrong part of the code
Question about equation 11: If we're summing from i=1 to t-1, should the integer in the parenthesis be (t-i)?
If we sum from i=1 to t-1 and use (t-1-i), then the final element of the sum will be (t-1-(t-1))=0, is that on purpose? My understanding is that the final element should attend to the previous token, so it should be (t-(t-1)) = 1
this part (Appendix C) should probably be corrected, I left a comment
the immediate previous token goes through the u weight, not through the w weight
Is there a way to make that go away notationally? Like, if we set u = w_t does that cause problems elsewhere?
The separate weight for the current token throws me every time I look at the equation
but it is actually what happens in the code. theres a time-associated parameter for all positions except for the immediate previous one (see: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py#L186). we're just describing there. perhaps we could mention it in the line below that w_t gets its own set of parameters?
Yeah I mean… maybe the current one can be set to 1?
hmmm i dont think so, at least not without affecting performance. the U is there intentionally bc the time-association of W could be too strong of an inductive bias. i think it could be great if we did a small experiment like the SmallInitEmbedding with this difference and see.
thats why i keep asking for this #1103039376184852622 message @paper dove
Somehow that doesn't make sense to me. If the immediately previous token goes through u, then wouldn't that be represented by t-1? But the equation shows that u is added to the current key and multiplied by the current value, not the previous timestep
Yeah the current token goes through u I thought
Yeah, that's what we're saying in the text as well, "U attends to the current token" (paraphrased)
hmm okay so you're asking the immediate previous instead of the current, sorry. so yes i think it's on purpose? it is described as well from here: https://johanwind.github.io/2023/03/23/rwkv_details.html
I go through and explain a minimal implementation of RWKV in detail.
Okay, I get it now. Seems a bit complicated, but maybe that's just how it needs to be for the model to work. Maybe there's a nicer way to write it though, I'll think about it
Wow, the equations are really throwing me off. The current key is weighted by U, the previous key is unweighted, and the key from 2 timesteps in the past is weighted by W. Is that correct? I guess it makes sense but just seems very unusual
And by weighted, I really just mean that it gets a bias added to it so that it's actually scaling the value
Yes that's correct, it's equivalent to if you also reweight the previous token - you just multiply numerator and denominator by exp(w) and rewrite u as log(exp(u+w)-1)
Removed Fig 1. and links to this fig. Section 2 still remains.
Just made some grammar/spelling fixes. However, the Future Work/Conclusions section might need a rewrite. Also spotted that we're using abbreviations like 'LLM' without defining them upfront.
@tropic minnow@fickle hare After reading section 4.1, I feel like it might be better to have the order Token Shift -> WKV -> Time/Channel Mixing and Output Gating, because the r, k and v vectors used in WKV and others are defined in Token Shift and I feel lost when first seeing the WKV using these. Also, this is the logical order of how things are computed...
@tropic minnow What's ur opinion?
still unclear to me if the linear schedule or exponential schedule is used, I see it's deciding according to if there's a zero in lr_init/final?
also my_pile_edecay decides when to start decaying
hmm my take was that token shift is a tiny conv we add to increase performance, whereas main RNN-like properties come from (R)WKV, which is the "attention replacement" we implement and what people might be looking for when they read "a replacement to transformers". However @uneven blade 's point about (time-mixing) token-shift preceding the WKV computation is fair. I think we can go that way if others prefer it too, as long as we're systematic in the description of components it should not matter much
Personally I think as long as we highlight the WKV as a replacement for self-attention in the overview before we start diving into details, it will be fine
see LR history for everything: #1083107245971226685 message
Would you think describing the token shift as a 1:3 depthwise convolution with kernel size 2 is a good idea?
added a paragraph at the end of 4.5 describing details about loss, learning rate, and optimizer.
need to summarize the hyperparameters later in the appendix
hmmm i wouldnt try to push kernel fusion (1:3 for time-mixing and 1:2 for channel mixing) thoughts into the RWKV announcement paper since it is not even implemented like that in the code. Maybe a note saying "intuitively, this can be seen as a small convolution" of kernel size 2 or something. but i think there's already smth like that here:
The paper currently says
It is noteworthy that FLOPs are independent of the context length, unlike regular transformers.
This is false though? Transformer FLOPs is given by 6PD, no term for the context length.
Actually all of Appendix B makes little sense. The equations are self-contradictory, we present what I think are supposed to be three different approximations, and an omission of the number of data points entirely.
If it's the case that RWKV FLOPs are well approximated by 6PD (just like a transformer) we should derive that and just stop.
The text I'm primarily referring to is:
The number of parameters for each model is computed using the formula: $#parameters = 2VD + 13D^2L + D(11L+4)$ where $V$ = 50277 is the vocabulary size, $D$ represents the Model Dimension and $L$ corresponds to the number of layers.
FLOPs is for a forward pass for one token. It was calculated as $6(VD + 13D^2L)$, which is the twice (add and multiply) the number of parameters in linear layers. The backwards pass FLOPs can be approximated as twice that of the forward pass. So the total is $6(VD + 13D^2L)$ per token for training (3x fw FLOPs). It is noteworthy that FLOPs are independent of the context length, unlike regular transformers. The FLOP approximations in this paper are in line with the methodology used by Kaplan et al. (2020).
Stella Biderman (she/her)
I think it's pointing the second term
Okay i think it's correct. Counting Transformer flops per token involves computing self attention against history KVs, which has a FLOPS linear to the history size
Why are you only counting the 6PD from the head?
(yet the first 6 should be 2 I think
Read the next paragraph
I believe that 6PD is a good approximation for total training FLOP for both models
fine. with not really long context attention flops are negligible
It's still not the case. 12LD^2 is much larger than PD.
Sorry my D is “dataset size”
Not “hidden dimension size”
So in per-token units this would simply be 6P
Which is what the text (but not equations) of the passage I quoted says
then the problem is whether to mention the square yet smaller term in transformers flops
for transformer it's 'approximate' since it throws the context-growing term away, but for us it's accurately 6P per token
What about the D(11L+4) term? It goes away, and I assumed that’s because of the same kind of reasoning
it's the token shift and wkv parameters I think
okay it's not calculating the elementwise muls and adds now...
but they are all constant for each token
it's missing and I'll do some calculation for wkv and add that
I really don’t think having it exactly matters
yeah it's negligible compared with the linear layers
It's just... the omitted term for us is constant while for transformer is linear to context length
That’s not a real difference
It doesn’t make us look better to point it out, it makes it look like we don’t know what matters.
I agree
I have seen some people questioning the initialization settings in RWKV. “Initialization of parameters in the popular RWKV model is done by setting all parameter matrices to zero. It is claimed that this approach avoids the noise introduced during the initial learning phase. However, this practice is highly unreasonable. Initializing parameters to zero can lead to issues such as symmetry problems, vanishing gradients, lack of diversity, and slow convergence speed. In small models, zero initialization is rarely used. Instead, methods like Glorot initialization and Kaiming initialization are commonly employed.”
I think that the first thing to do is examine Blink’s code and see if it actually initializes everything to 0.
It seems that this approach is counterintuitive for many people, and perhaps it requires more explanation or persuasion. @obsidian quest
I vaguely remember this discussion from a past conversation. I believe the key point was that because sigmoid(0) equals 0.5, the weights are able to be updated
But yes, more clarification on this point would certainly be good.
if it is due to sigmoid, maybe this initialization is not general, it highly depends on RWKV design
maybe also ablation study? initial iterations on small models would be sufficient
@tropic minnow I've added the ethics statement that I mentioned earlier. Feel free to review and tweak as needed.
I initialize some matrices to zero (not all of them)
only these are initialized to zero:att.key att.receptance att.output ffn.value ffn.receptance
For each timemix/channelmix block we just need randomness in one matrix
namely: att.value ffn.key and this is enough to provide gradients
Note e^0 = 1, sigmoid(0) = 0.5 so the design is related to RWKV
he thought i initialize everything to zero
Initializing residual tracks to 0 makes sense so model starts at identity. Only weights that need to be different than 0 due to symmetry otherwise should be nonzero. Starting at identity (see clean path reference in the text) makes learning faster and better.
I cannot find the bug in my scaling laws code
I run this
l = defaultdict(list)
for d in df.keys():
x = d.split(" ")
loss = float(df[d].sort_values('Gtokens').tail(1)['loss'])
layer = int(x[0][1:])
dim = int(x[1][1:])
print(layer, dim)
print(params(layer, dim))
print("---")
tok = float(x[2])
l['L'].append(layer)
l['D'].append(dim)
l['T'].append(tok)
l['loss'].append(loss)
l['params'] = params(layer, dim)
l['compute'] = 6 * params(layer, dim) * tok
df = pd.DataFrame(l)
which prints out the expected thing:
12 1024
266.684416
---
24 1024
430.39744
---
24 1536
890.962944
The very next cell does this though
params and compute columns wrong?
is params a pure function?
ah i see
instead of
l['params'] = params(layer, dim)
l['compute'] = 6 * params(layer, dim) * tok
do
l['params'].append(params(layer, dim))
l['compute'].append(6 * params(layer, dim) * tok)
oooo
Thank you
Eyyyy look at that beautiful straight line
(minus the one point which I think is an overflow error)
how does that compare to pythia?
Running the math now
Or, "I will run the math after my 1:30 meeting" since I just noticed the time
@obsidian quest for the Ethics statement, would be good to know exactly which data has been used to train Raven-14B beyond The Pile
current statement describes:
- Open Source Data (the pile), publicly available data (raven?)
- Open source training codebase and lower inference cost (democratization)
- Efficiency in training (effort to lower cost, "sustainable")
- Various sizes released (accessible deployment, study of emergent phenomena)
- Easier to generate AI text (lower cost Chat assistant, fake news, misinformation)
- Potential replication of biases/harmful content in data (but transformer mitigation strategies should work here as well)
can plot all runs (not just the final losses) in the graph
do we have any missing runs
Unfortunately you can't... Not if you want to get the correct results.
For example, the equation I'm getting is quite different from the ones the original experiments had. This is the original experiment
Hmmm I think my code might have a bug.
Agreed. This could significantly enhance the open-source/reproducibility aspects of our project.
I believe we used these resources, but it would be great if @obsidian quest could confirm:
https://github.com/tatsu-lab/stanford_alpaca
https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k
https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
https://huggingface.co/datasets/nomic-ai/gpt4all_prompt_generations
https://github.com/lm-sys/FastChat/issues/90#issuecomment-1493250773
Also, knowing the exact split/iterations would be helpful
I think we just need more density of runs with slightly different values
Here's the data sorted by amount of compute used, and there are clearly runs that are more optimal (17 and 22 are particularly good for example) but there isn't the necessary data density to really get the tradeoffs optimized
This becomes especially obvious when you look at x-axis that aren't "compute"
I was looking through the Chinchilla paper and found this, which shows all the configs they trained for their paper
What they did was set a total FLOP target and train each model for the number of tokens necessary to reach each target, with 9 targets per model.
By contrast we have 7 different models currently
So if we can generate more data that would be A+. Just... more models, more # of tokens
There is a lower edge to the compute-loss tradeoff currently that's approximately linear. I'm going to try to extract that now
It feels like this is the optimal line with the data we currently have
Slope: -0.09467861
Intercept: 1.80843822
(or in log_10, that's -0.04111839787, 0.7853947398)
@obsidian quest to illustrate why this matters, the original value was -0.053. -0.041 vs -0.053 is a huge change
why are these two charts different #1103039376184852622 message #1103039376184852622 message
we still need to check if we can actually find a 10^5 compute datapoint on your line lol
I was about to comment that you are using flops = parameters * 6 while the previous plot used non-embedding parameters * 6 (like the scaling laws paper, since embedding is no flops). However, I just reran the old plot with flops = parameters * 6 and still get -0.053, so you are right that -0.041 is a huge change.
Also, while I'm here: I wrote that section in a very early draft (I think it was among the first additions after the tex file was created by someone else) as a kind of internal data table for making plots like the scaling laws plots (with the intent that we agree on one of the approximations for the flops, etc.) But it kinda just stayed there I guess. Feel free to remove it / scavenge it for scraps for other sections.
I’ll probably scrap the appendix section and incorporate parts of it into the scaling laws section.
This is due to the aforementioned bug in my code. Happy to upload the notebook for people to inspect but I currently think it’s currently correct.
More model / data combos would help with this substantially. Right now I am struggling to tell you how much data and params to use.
Have you posted this code? It’s probably worth looking at to make sure we’re doing things the same way. Also I want to steal some of your visual formatting 🙂
I posted this code, and then @rich raptor made it pretty
I helped him find a bug in his code, so I have an old version. Maybe he has a newer one
In such cases, what do you think about the current figure 1? It's basically talking about the same thing but in terms of "time complexity".
Some comments on 4.4:
- Shall we merge "Custom kernels" to "4.1.2 WKV Operator"?
- Shall we remove/merge "FFN with R gate" since it's now in "4.1.3 Output Gating"?
- I'm curious whether using the abbreviation "init" in "small init embedding" instead of spelling it completed is intentional.
- It seems both "Small init embedding" and "Custom initialization" is talking about parameters initialization, except that smallinit requires some architectural design to cooperate with it. If the former two paragraphs are merged to somewhere else, shall we turn the whole section into sth like "Model Initialization"?
Hey is this paper still taking helpers?
- No strong feelings, but it might hurt readability since they cover distinct aspects.
- Seems reasonable as it improves clarity
- Kind of a branding name that has made its way. could rename it for the section title but i'd like to refer it as
SmallIinitEmborSmallInitEmbedthroughout the paper for historical reasons (https://github.com/BlinkDL/SmallInitEmb) and bc its a shorter name.
- yes, but in a sense they're quite orthogonal, as
smallInitEmbcould be applicable to every transformer and was specifically tested (see experiment) whereas the rest of layers are more specific to RWKV and we dont have as hard as a justification for them, simply trial and error during RWKV evolution
could you generate a large graph with N_LAYER-N_EMB-N_TOKENS info for each datapoint 🙂
which one has loss ~2.23 (the lowest loss in the graph)
Lowest loss values
will be great if we can mark L-D-T for each datapoint
Here's the CSV with all the points, the color is red if it's on the bottom line I identified
Currently sorted lowest to highest loss
Tokens are in billions, params in millions
compute in units of 10^15 FLOP
Oh, here's the points colored red on the scattterplot
Oh that's without a log on the y-axis
but w/e
Gets the point across
(note the slope and intercept numbers are different now because these are in log base e while before I was converting to log base 10 since that's what the original work was in.)
loss of 24-1024-16 should be around 2.46 - different from the number in your table
Cool! let's do it
need to smoothen the loss curve before using it @young sparrow
What do you mean by that?
when you download the raw loss curve from wandb, it will be extremely noisy
How did you estimate this
If I sort all checkpoints from that run by loss I do see that value (and even lower!)
can you plot the loss curve of this run
because this is the raw loss of each batch
the best method will be to compute a curve fit
Here it is on a log-log plot
transformers don't tend to be this noisy though
I'll try subsetting to one in every 10 datapoints then
EMA isn't helping, neither is subsampling
curve fit
a linear fit of the last 30 data points
Line fitted to the last 30 points
Everything except the first 50
Yeah this simply isn't working
Here I tried fitting the line to 30 points near the end of training and then projecting out the next 100
everything except first 50 looks good
Okay zooming in to the last 500 it does actually look better than I had thought
So you want me to fit this line, project it out to the full training, and use that as my loss instead of the observed loss?
And re-do the scaling laws experiments?
yeah
@obsidian quest
hmm that's kinda misleading as the y axis has changed
Hmmm. This looks suspicious
Variance going up seems like a bad sign
(the outlier is from a run that didn't restart, I had been removing it before)
pls send me the L-D-T csv
With the predicted values?
Or the real ones
Here's the one with the predicted values
ok seems your code is buggy
for example L6-D512-T32 should be around log(3.01)
and L6-D512-T32 should be around log(3.05)
You're welcome to take a look
It looks like you just rotated my plot and played with the variance lol.
fixed version
That looks a lot like this?
I’m our right now but can check it out in a couple hours
the numbers are far more aligned with wandb webpage charts because i download up to 50000 datapoints for each run
wandb default = only fetch 500 datapts
I tried to fiddle with that config but it seemed like it wasn’t doing anything
😦
So I gave up and assumed it didn’t work the way I thought
So is this what my code gives now, with all the data?
How different is the actual vs predicted numbers
just more noises in "actual"
note one of the runs lasted longer than T which was before i added exit_after_T to training code
so now i am predicting the loss @ T instead of x[-1]
Interesting. I went with x[-1] because some runs made it within a rounding error of T but not actually T. I had assumed this was because it wasn’t evenly divisible by the batch size, but I guess it was the sampling
@obsidian quest Hey is this paper still taking helpers?
@fickle hare I've cited the 5 fine-tuning datasets that I know we used to the ethics statement. Could you double-check to see if I missed anything?
I'm out now, will check ~7hrs later
Commented out FFN with R Gate
Looking a lot better once @obsidian quest showed me how to fix the data lol
(blue points are used for the regression line)
Note that both axes have a log on them
This gives an exponent of -0.0747
or you can simply use my datapoints 🙂
I am using your code
- data
I'm just picking up the analysis where you left off
These plots worry me though
The empirically low loss point with a compute value between 12 and 13 is way off of the line for params and tokens too
24-2048-1.0 is missing and you can ignore it
for some reason, your chart is different from mine #1103039376184852622 message
the results basically tell us that we should train larger models for optimal T=32
No mine is the same, I'm just taking the log of the raw data instead of putting it on a log axis. Here's a log axis
(it's slightly distorted due to np.log calling log base e)
Here is everything in log base 10
Oh there's bunching at 0 due to loss of precision (units of billions and then taking a log). Lemme fix that
I don't know any more english instruction dataset used; firefly, belle, and some other are used for chinese. Need @obsidian quest to double (triple? lol) check though.
Updated. This also made me realize that we didn't mention the multilingual capabilities of RWKV.
are you using pred_loss? seems you are using loss (noisy)
I am more comfortable using loss than pred_loss, though I'm planning on looking at averaging the loss across several steps next
loss is extremely noisy
I know, but I don't think that fitting a linear model to it is something one should rely on fundamentally.
yet it is still a vast improvement
In what
for example, your red datapoint is noise
@obsidian quest Did you launch more runs?
The biggest problem is data scarcity. We can hardly call something paredo optimal if there are no other equi-compute points
yeah could you find blue points using pred_loss so that i can use the info to launch more runs on pareto front 🙂
Kk
I love the addition of ' to the variables used in channel-mixing!
I'm not sure who has access to Figure 1, but I think we should update the variable names in mixing: R', V', K'
Happy to do it if somebody can give me the file
Params (B): 0.0625, 0.125, 0.25, 0.5, 1.0
ok done🙂
One small comment on Figure 3, we should add legend to x and y axises. Not sure who’s the author. I can help to update the figure. Also, do we need to add error bars for the accuracy scores?
@rich raptor
@young sparrow use these pred_loss data for the most reasonable fit
ignore L6 and T1 results because they are too different from usual runs
Hot damn
How do the parameter & dataset curves look?
Any less cursed?
@serene badge can you take a look at this? #1103039376184852622 message
Cool, I’ll handle it.
Figured out how to modify the script to update the figures. In the script, we need to get the data of Pythia and RWKV from the ./RWKV.csv. Where can I find this file?
#1103039376184852622 message
Not sure if there's a more updated one
Great! Thank you so much! I'll check the newly generated figures to ensure they match the original ones.
I've updated Figure 3. Added legends and changed to PDF format.
Ok it's time to buckle down for EMNLP. I'll be doing regular check-ins like we did for arxiv. Here's what currently needs done:
1. The ethics statement (section 11) needs shortened. No longer than a half page. Nevermind we have the space.
2. @young sparrow and @obsidian quest -- What is the status on your scaling laws work? I assume that'll need to be a new figure/paragraph once finished, or will these just replace the current Figure 5 scaling laws plots?
3. We're currently at about 8.5 pages on an 8-page limit. Should we move section 4.5 Additional Optimizations to an appendix? Nevermind we have the space.
4. Figures 4-6 have strange placement, there's some space at the start of Section 7, and Figure 5 is out of order. These figures should instead be split across pages 6 and 7.
5. Sections 8 (Future Work) and 9 (Conclusions) are very long. We should cut or re-word so that a few lines are reduced. Nevermind we have the space.
6. In Figure 6, we should remove the cuda_ prefixes from each legend entry.
7.~~ Result figure captions should be descriptive enough to be self-contained (i.e. easily screenshotted). Figures 3-6 should have their captions updated, but don't make them longer than 2 lines.~~
I'm submitting a draft to EMNLP today. Here are the deadlines:
Abstract Deadline: June 16 (Will be submitted today)
Paper Deadline: June 23
Core author team -- Feel free to add work items to my above list.
I was going to offer to re-write the ethics statement, but it seems it's currently 1/2 page. Do you still want it shorter?
Also, I re-wrote the future work into paragraphs rather than bullet points, saving 3 lines
Ah oops. I meant halved from its current length. It looks way too long to me
👍 I don't think they put a space limit on eithics or limitations sections though
@everyone -- If you're an author, I need your email for the EMNLP abstract submission if you haven't sent it to me already.
No, but we're over the 8-page limit and I'd rather remove from the ethics/conclusion/limitations sections rather than content sections. I'm open to suggestions on where to cut though
Ethics and limitations don't count towards the page limit. Let me just get the exact reference information for that
Limitations doesn't count towards page limit: https://2023.emnlp.org/calls/main_conference_papers/#mandatory-discussion-of-limitations
Official website for the 2023 Conference on Empirical Methods in Natural Language Processing
"Authors will be allowed extra space after the 8th page (4th for short papers) for an optional broader impact statement or other discussion of ethics": https://2023.emnlp.org/calls/main_conference_papers/#ethics-policy
Official website for the 2023 Conference on Empirical Methods in Natural Language Processing
Ser you pinged everyone in the server
For me future work has only negative aspects: (1) another valid title for all points under it is "things we didn't do"; (2) it's very rare that things mentioned here are actually done, so they remain as evidence of promises authors made but didn't follow up on. So I always prefer to keep sections like that completely out - big obvious omissions can be mentioned in Limitations
I see. So are we supposed to have a distinct page after Conclusions and before references?
This sort of format is new to me so you'll have to bear with me.
Yeah, you can put a \newpage after conclusions. But it's generally a good idea to make sure that the conclusion goes until the very last line of page 8 anyway
No worries. *ACL conferences are really different from the rest of ML
I would highly recommend a \newpage for consistent formatting of the subsequent text (otherwise it could bump around plots in the appendix)
@young sparrow the plot is even better if we only consider non-embedding params
Yeah I ran that locally and didn’t post it yet 🙂
after (sry buggy. see below for update) vs before
It looks really good
wait fr? I thought forum channels were like a thread where only current participants get pinged. I feel like we'd be brigaded by now if I truly pinged everyone.
i am running L32-D2560-T16/32/64 (T16 done)
In a thread you wouldn’t have pinged everyone, and I think forums work like threads
I can confirm you didn’t ping everyone
Awesome. So ignore all of my space concerns then. Updating my work items to reflect this. (Done)
points 2 and 4 will be conditioned on scaling laws word v likely
corrected. good fit even for L6 and T1
I have a question. If I understand the paper correctly (and maybe i don't), you have explicit bias toward more recent tokens, wouldn't that degrade the result for some model tasks that are not necessarily Languge related? that kind of bias isn't present in transformers.
at least presented in alibi transformers
i think it's okay to introduce some locality bias because that fits most data we care - text, image (if we have 2D RWKV), music, time series, etc.
for #7: updated captions for 4, 6. Hopefully its better now. feel free to rephrase. Will do #6 in a few hours (8 for sleeping)
I guess that's a fair compromise when your token length is practically infinite. though it'd be interesting if we could balance that bias in different ways
or use RWKV-5 🙂 complex-valued decay = rotate
My name : Atsushi Saito, email: [email protected]
fig 6 updated to remove "cuda"🙂
It's hard to see why this is happening. The spacing seems unchanged when I remove the author block
this wont survive in the camera-ready version. but i dont know if that is allowed
i think this: \titlebox{6.8cm} is the offender
I removed that and it didn't fix it either
ah sorry \maketitle is the responsible
That doesn't have much explanatory power. That's the command that tells LaTeX to display the title block, but could mean anything is to blame.
I noticed that in 4.6 we referred to the implementation as RWKV-LM, but later, we go straight into RWKV-4. (it might not be clear to some readers). Perhaps we could change RWKV-LM to RWKV-4, or smoothen the transition?
Also, it may be better to change RWKV to RWKV-4 under Appendix G Inference results to be consistent with the other figures.
Yeah we never give reasoning for the "4" either. It's weird to claim we're on RWKV-4 for a paper introducing RWKV.
I propose either changing all of these instances to RWKV/RWKV-LM or explaining what RWKV-4 is. Whichever @obsidian quest prefers.
A bit of negative vspace around the titlebox is generally OK for ACL subs
My vote is for RWKV or RWKV-LM
My vote is for RWKV, there isn't a non-LM RWKV and chars take page space
yeah just RWKV
Something appears to be overriding our ability to move the top of the text at all. Even using vpsace won't move it upwards
A bit sad, the network represented best with the Cell in my opinion, is “just” a recurrent network not necessarily connected to a LM
For example can be applied as BiRWKV for sequence labelling, it works amazingly well
Yeah, completely agree the architecture is generalisable. Maybe when that's actually done, the name can change meaning - just like with attention and transformers, which were both designed for & presentd as NMT approaches, then outgrew that task
What does this mean
What is the "language model" that RWKV is "connected" to?
... do we have numbers on this? Can we put it in the paper?
I have fixed this now by copying the original emnlp2023.sty from official website. Our emnlp2023.sty has been modified to fit many authors into the title section.
I don’t think is worth it. My focus is (well, was) dependency parsing so I just experimented with a variant of https://aclanthology.org/Q16-1023/ replacing the lstm with rwkv. However, it is not so cool anymore this task and it would be not so effective for this paper. Maybe for a follow up subject to show out of the box improvements in old fashioned tasks
Yeah we could train a suite of BERT-like models and finetune them on standard BERT-applications
I know @obsidian quest has talked about doing something like ViT using RWKV too
Oh interesting.
Oh shoot I think I did that for the arxiv submission
oops
Got it and agree. I’ve running a “ELMo” variant with rwkv just for fun. Same dataset as the original so benchmark is possible.
decrease the number here will do \titlebox{6.8cm}
No it doesn’t.
uncomment \setlength\titlebox{6.8cm} and decrease the number will do ( I tried), and the current workaround with \begin{comment} is also viable. But i am not so sure if these two methods would violate the requirement of formatting
We have already fixed this problem, please stop touching it
ok
Does anyone know if the RWKV implementation in transformers is reliable yet
bf16 inference and training should be all good now, not sure about fp16 inference
yet there are reports on the cuda kernel not successfully compiled... not really reliable yet, use carefully
It didn’t launch out of the box :/
I have runs of BoolQ and MMLU on Pythia / OPT / BLOOM if anyone wants to run the comparison in RWKV
I've submitted a version along with the abstract for EMNLP.
If you did not receive an email from OpenReview: This means you haven't both:
(1) Created an OpenReview account
(2) Sent either me or this channel the email associated with that OpenReview account
If you didn't receive an email, please do these steps by tomorrow. Once we have more authors on the openreview, we can re-order them alphabetically.
@young sparrow and @obsidian quest -- Your scaling laws plots are the last outstanding results. What's the status? What needs done and who can help?
The code Blink sent me with some changes doesn’t work for me, I’m trying to debug it.
i sent an excel file with all datapoints
I'm actually pretty happy with the writeup and overall storyline. If anyone knows academics who can give us good feedback, it would be good to receive that.
Lead authors -- Do a pass now and update anything you don't like. If you need help updating, message work items here.
[we’re talking in DMs]
Just created an account, my email is [email protected]
Sasha Rush mentioned he was missing some ppl plots/figures from the arXiv draft - not convinced these make sense for cross-architecture comparisons but, that was the feedback
One complaint I’ve heard is that people don’t think that 6 evaluations are enough anymore. If we can run MMLU, BoolQ, Natural Questions, HellaSwag, TriviaQA, and RACE that would give us a lot more comprehensive of a picture, and most of the plots from the LLaMA paper (missing math stuff we can’t run right now and code evaluations)
(We’ve also gotten the same feedback about Pythia)
Just added another batch of authors. Some that I'm still missing:
- ~~Michael Chung ~~
Xuzheng HePrzemyslaw KazienkoJiaming KongBartlomiej KoptyraHayden LauAtsushi Saito- Bolun Wang
- Ruichong Zhang
- Qihang Zhao
- Peng Zhou
Haowen Hou
idk what this means
Is anyone able to pick this up? We'd need these results by 6/23. Need two volunteers
PPL is an abbreviation of Perplexity, Sasha is lead scientist at Hugging Face and a Harvard prof, usually gives reasonably strong signal
Quentin knows all of those things. What he doesn’t know is what Sasha actually wanted us to include.
There isn’t such a thing as a ppl plot/figure, and saying we should include one doesn’t mean anything.
don't we have 13 evaluations : LAMBADA PIQA StoryCloze16 Hellaswag WinoGrande arc_challenge arc_easy headQA openbookQA sciq triviaQA ReCoRD COPA
There are six in the main body, are these in the appendix?
We do. I just thought Stella's specific evals are the current "llm scoring meta"
If it's just the number of decent evals that matters then we're fine
I think that MMLU is probably important to include
This should be resolved by the new scaling plots I believe
But, if someone explains how to run the model through the eval harness (HF is still borked) I can take care of things
Evals were already done before we started the arxiv. @obsidian quest or @tropic minnow -- Who ran these evals and how can Stella reproduce them?
@obsidian quest did here #1103039376184852622 message
and the plots for the 6 tasks were done with this: #1103039376184852622 message, maybe @serene badge can comment more on any other mods
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.
i ran them and you can try reproducing them using HF package (to check the correctness of HF too)
The HF package doesn’t run currently
whats the error? can tell them
I’ll reproduce it in a bit and let you know
Using pretrained=RWKV/rwkv-4-169m-pileraises
File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 384, in forward
attention, state = self.attention(self.ln1(hidden), state=state, use_cache=use_cache)
File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 308, in forward
receptance, key, value, state = self.extract_key_value(hidden, state=state)
File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/transformers/models/rwkv/modeling_rwkv.py", line 300, in extract_key_value
key = self.key(key)
File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mchorse/ollmer/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Meanwhile BlinkDL/rwkv-4-pile-169m appears to be misconfigured as it lacks a config.json
okay now it's wroking and I have no idea what changed
...
@obsidian quest Do you have the smaller models with all these benchmakrs too? Or just the biggest RWKV
This code doesn't run because it relies on a file called rwkv.csv, can you share that file
all models from 0.1B to 14B are there
Oh I missed that colimn
and that's the .csv I was looking for, isn't it
Okay, so we actually have pretty comprehensive evals they're just not fully presented
@obsidian quest the .csv you shared says that the context length of the largest model is 8192 and that the 3B model is 4k. Is that correct? What's the context length for the models that don't have a listed context length?
not listed -> ctxlen 1024
Why does the context length change
all are trained with ctx1024 for 1 epoch, and then finetuned to 2k => 4k => 8k
Right, but why are we comparing evaluations on models of different context lengths
Why isn't it consistent
longer ctxlen => slightly worse zeroshot if everything being equal
because these tasks only care abt short ctxlen
it's just that 4k and 8k models are trained longer, so 7B & 14B can gain some advantage from this
dude
1.5B & 3B ctx4k are slightly worse than ctx1k for this
You can't do this in a paper
you can list all ctx1k numbers
Do you have context 1k numbers for all the models? The csv you sent doesn't for 3B or 14B
you can list all ctx1k numbers
RWKV-4 3-ctx1k 5.24 57.52% 63.94% 73.72% 70.28% 59.63% 59.43% 31.83% 64.27% 28.74% 37.60% 85.70% 11.07% 80.56% 81.00%
R14 ctx1k 14.2 3.81 63.54% 71.05% 77.42% 75.57% 70.24% 62.98% 38.31% 70.71% 32.28% 40.60% 90.10% 24.06% 85.73% 87.00%
just create an account, my email is [email protected]
The 13 benchmark results of RWKV-4, Pythia, GPT-J are included in RWKV.csv.
The 6 benchmark results ("lambada", "piqa", "winogrande", "arc_challenge", "arc_easy", "sciq) of OPT, BLOOM come from pythia/result directory of pythia repo.
Seems the json files of OPT, BLOOM do not contain the other 7 benchmarks ("triviaqa","storycloze16","hellaswag","headqa","openbookQA","record","copa").
I think that's why in the script from @rich raptor , we only plot figures for 6 benchmarks.
The right way to solve this problem is to run those models on the additional benchmarks. I’m currently doing this
this was a bad node
Just created an open reveiw account for Email that I've already sent to you
I just created OpenReview account too. Been so busy with my final exams
L32 D2560 T64 pred_loss 2.047399
@obsidian quest I'm trying to add the exact hyperparameters to the Appendix. In #1083107245971226685 message you presented 6 column groups, are they in the order of 14B/7B/.../169M? In each column group, is the last column tokens trained? Also, it seems your adjustment on batch size during training is not directly visible in this table?
14/7/3/1.5/0.43/0.169B
yeah tokens trained (ends at 332G tokens)
yeah invisible
(another point influencing the reproducibility)
I can try to recover it though. All training is done with ctxlen=1024 right?
I'm a bit confused, is the batch size = 128samples for each GPU? Cause in the LR history file it shows 8043 steps for 332 billion tokens, which counts to ~40000 samples of 1024 tokens each step.
Also through analyzing the Gtokens I don't observe any batch size change. It goes smoothly all the way down.
(all 315 * 128 * 1024, guess you are using 315 GPUs or nodes lol)
8043 "miniepoch". lots of steps in a miniepoch
i use 128 or 256 as total bsz. or you may say 128x1024 or 256x1024
real steps per miniepoch = 40320 / bsz
so we won't be able to report the accurate batch size then i guess?
you can, from https://wandb.ai/blinkdl/RWKV-v4-Pile histories
but there are 2068 runs
because all runs are killed multiple times due to server issues
apply filter for nlayer & ndim & ctx1024 & datafile = BlinkDL/pile/pile_20B, and check the run around the release date on HF
let me put the numbers we have in hand into the paper first
if i still have time later but not too late, i'll try dig it out
Added Appendix Hyperparameter.
related cross-reference is also brought back (previously commented out)
@sullen horizon hows LRA going?
Are you discarding the ten points with the least compute here?
i am using these 12 points
Why not 6 512 1.0
Why those numbers specifically? Even using them, I'm unable to reproduce your fit and looking at the plot it's not at all clear why those were chosen
use non-embedding params
Is that 2*V*D + 13*D*D*L
simply 13*D*D*L
(rerunning, vaguely embarressed I missed that)
So how did you pick these specific points to include in your fit
ok pls use this. the idea is to pick larger models as T grows
for example, the optimal T for L12 D768 is likely around 3
very small & very early results are outliers
Okay, but why not this one?
This one actually shows all the compute-optimal values
I'm worried about the excessive reliance on heuristics
shouldn't this be a simple envelope?
RWKV@LRA code is hear https://github.com/diggerdu/rwkv-long-range-arena (based on s4, I will update readme lately)
I don't see any other points that dominate it
because we are still missing some intermediate models here
What do you mean “intermediate models”? Do you mean partially trained ones?
like L9-D768, L18-D1024, etc.
Do you have data for those models? They don't seem to be on WandB
i mean we havent tested them
So no, you don’t know that they perform better
It’s really important on a scientific level to not make things up like that. If you want to run them great, let’s add them. But you can’t say “oh I know how this experiment we haven’t done will turn out”
nice! do you need/want anything in order to run it?
yea i think if we don't have a better datapoint in our data, then that's point is the optimal we have been able to get so far. imo the methods should be as good as possible, even if they don't account for corrections that we might have intuition on but are unproved so far.
This is the best we can get with the current data. In the last plot, we see the slope of the line corresponding to Chinchilla scaling I do believe that this line is likely much closer to the true value, but we don't have the sampling density to really tell.
(click on images to see the equation for the trend line and r^2)
There's a bit of missing data still running (will be done by the end of the day today) but I otherwise have the missing plots as well
what's going on with boolq
Placeholder zeros are missing values to avoid the script from crashing
ah ok
aka "I forgot to save most of the BoolQ results"
Very cool. OPT is weird in COPA, interesting finding. Could I ask for bigger symbols? These red/green tones are a pain to distinguish and the up/down triangles aren't so distinct this size. Happy to edit the graphing code
Yeah I've put no effort into the data viz, planning on doing that in a bit
@obsidian quest I'm noticing that some of the RKWV evaluations are using acc_norm and others are using acc. Do you have all the results for acc?
@everyone -- Many of these still need done. If you don't create and send me these OpenReview accounts by the paper deadline you will not be an author.
Plots now look like this (see paper for all of them), but if you'd like to fiddle with it more you're welcome to
hello, i think i may find a little bug about RWKV initialization. In the paper, we said that we initialize all W_{r}, W_{k}, W_{v} to be zeros, but it is not the case in the RWKV-4
this line of code is never used in the Init function (which i believe is to control the zero initialization)
by adding a print debug line here, i also find that the matrix is not initialized to 0
does anyone also notice this? initializing all the parameters to be zeros seems a little bit weird
please check against v4neo (instead of v4)
see #1103039376184852622 message
pls fix that part
In RWKV-v4, this line of code is never called. So, i believe there is not a parameter matricx initialized to all zeros
I am testing v4neo
but i think our paper should correspond to v4?
sorry, but i can't run v4neo because of this AttributeError: partially initialized module 'charset_normalizer' has no attribute 'md__mypyc'
but it looks good
so please fix this in the appendix. Thanks
thanks for response
no it's v4neo
(I guess it's worth mention somewhere in the paper, or just clean up the obsoleted ones in the code base)
git always keeps history, so leaving them there unused is unnecessary
@fickle hare We should absolutely make a cleaned up codebase that only has the necessary components. The current codebase is pretty unusable to a new person.
I've been working on a new Lightning 2.0-based trainer using the new CLI (the most recent improvements are by @void quartz). It's pretty usable now for finetuning, but data preprocessing is still in a preliminary state, and model initialization is missing. Just too busy these days.
@obsidian quest which of the models are the ones hosted on the RWKV HF page? How many tokens were they trained, did they do sequence length extension?
glad to see someone is using Lightning2.0 and its CLI rather than argparser or hydra🙌
The rwkv-pile series are all trained on the 332G Pile. Checkpoints with ctxlen>1024 in the file name come from sequence extension.
None of the models on HF have that file name, so those models haven’t been uploaded to HF yet?
ah
let me see
such as this one
However I now wonder if the 8043 epoch checkpoint is anywhere... https://huggingface.co/BlinkDL/rwkv-4-pile-14b/blob/main/RWKV-4-Pile-14B-20230213-8019.pth for example 8019 is the latest I could find
Blink told me that the official versions were going to be the ones on the RWKV org page
i see. then i have no clue which checkpoint did they convert to HF format 😭
you can find it here - if you want to read through it as an alternative - the existing v4neo has lots of "experiment flags" and can be hard to read : https://github.com/Blealtan/RWKV-LM-LoRA/tree/dev-infctx
( I am still helping bugfix and test it by using it extensively in my current experiments - will be helping adding the missing model init / preprocessing - cause i need it too 😉 )
at some point it should go to a distinct repo and migrate to HF model, after state chained backward is supported by transformers.rwkv
The "attention free models" section of the Related Work section was getting a little long so I split out the RNNs into a third subsection.
I also added some details about Hyena, as that's simultaneous work where they train a single-digit billion parameter state space model (and compare to us!)
[email protected] @last mauve
for anyone who is interested in a clean code base of RWKV for comparison with GPT-series
Amazing!
@young sparrow -- Are you able to get your scaling + eval plots in today?
I'm doing it now
Paper's due friday and I want ppl to be able to update writing accordingly in time
Figure 3 (0-shot performance on LM eval): Any intuitions on why Pythia performance drops significantly for the point with highest compute?
I filled in a couple missing data points with 0s. They're currently running
Gotcha!
@obsidian quest The paper currently says:
The number of parameters for each model is computed using the formula: $\text{# parameters} = 2VD + 13D^2L + D(11L+4)$ where $V$ = 50277 is the vocabulary size, $D$ represents the Model Dimension and $L$ corresponds to the number of layers. FLOPs is for a forward pass for one token. It was calculated as $2(2VD + 13D^2L)$, which is the twice (add and multiply) the number of parameters in linear layers. The backwards pass FLOPs can be approximated as twice that of the forward pass, giving a total of $6(2VD + 13D^2L)$ FLOP per token. Notable, this matches the standard formula for FLOP calculations in transformers \citet{kaplan2020scaling} $$\text{FLOP} = 6\cdot [\text{# tokens}]\cdot [\text{# parameters}].$$
Can you confirm that this is correct
Stella Biderman (she/her)
Compile Error! Click the
reaction for more information.
(You may edit your message to recompile.)
@tough crane were you the person who put together the evaluation tables in the appendix?
yes in part but I did not compute its metric vals(other person pasted a raw metric valuues). Is there some wrong or incorrect values?
The models we compare to in the tables are inconsistent with the plots. I think it would be a good idea to update the tables to show the numerical scores from the plots.
(This is a historical artifact of Blink starting this work before Pythia existed)
If you post the code that generates the tables I can update it pretty easily
If you post the code
Is this a latex code or code for model inferences?
The code for generating the plots from the evals
As you pointed out, ours script/notebook is to load Blink's older experiments RWKV.csv
Blink's experiment accs
So, did you just manually make the table in the paper?
While I could not find the converter to convert exact overlewaf source at this moment, this is a nearly same one. ( I added captions and other probably person modified this source. )
@last mauve @tropic minnow The scaling laws and evaluation sections are largely done. I'm tweaking some of the wording of the context length extension experiments because it's not true that quadratic transformers can't scale to a context lenght of 8k, but don's currently anticipate major changes to the sections. We were far over the page limit, so I commented out Section 2: Related Work and it seems to fit pretty well now. If people are okay with that not being in the main body, I can move it to the appendix.
The current version has the plots in the main body because I find plots much more accessible than tables, but we could flip that and put the tables in the main body (LLAMA does this, for example, but most don't). This would require substantially less space.
Got it. For related work, I'm of the opinion that we shouldn't include all of these eval figures in the main body. I think we should just have a few of the most influential 3-6 evals, then put the rest in the appendix along with tables. The related work section for this paper is especially important since few are familiar with attention-free transformers.
The spacing is janky but I've made that change
I generally dislike reporting mean accuracy across tasks, but that's something we can do here
I have to run, but I can make the new table tonight or tomorrow morning
The more I think about the sequence length stuff the more suspicuos of it I am though
This shows loss on the Pile batched by the sequence length of the sequences that we are evaluating on.
The claim this plot is making is that we perform better at predicting long sequences than short ones. Maybe that’s to be expected (though I don’t think so) but the effect size worries me. That’s a huge drop in loss!
The left half of the image is basically meaningless because sequences of a handful of tokens are often noise
But the idea that we see a real loss decrease when subseting our evals to 8k sequences instead of 1k ones seems suspicious to me
the alibi paper appendix has an appendix on “the early token curse” as a cause for ppl decreasing as seqlen increases: https://arxiv.org/abs/2108.12409
Since the introduction of the transformer model by Vaswani et al. (2017), a
fundamental question has yet to be answered: how does a model achieve
extrapolation at inference time for sequences that are longer than it saw
during training? We first show that extrapolation can be enabled by simply
changing the position representation method, though ...
the loss difference between 2^7 = 128 and 2^12 = 4096 is not much
Should we have more experiments for much longer lengths from 2^13 to the length comparable model's settings like Hyena?
Alibi 's experiments are tested up to 6k
I think it would be better, but that there isn’t time
at least cut the half below 128 tokens which is not really meaningful?
Yeah, we should absolutely cut the points below 2^5
Ok so we're near the finish line here
Here's what the average across all 12 NLP tasks looks like btw
@obsidian quest @last mauve @tropic minnow I've done a lot of fiddling with the paper with the primary goal of making sure that all the results in the appendix are actually referenced in the main text, while not going over the page limit. I'm stopping now before I go crazy fiddling over details.
(Feel free to disregard if you don't want to update the submitted paper.)
You fixed every issue I had on my TODO list before submitting 🙂
I'm submitting a final version now. If anyone has any last-minute edits they want reflected before the deadline tonight, ping me here.
hi should remove this arrow
Yes probably. It doesnt represent tokenshift appropiately… will remove it and update the latex figure, but emnlp submission is already done…🙃so it’ll have to go in the updated version
hi, may i ask what is the detailed setting in benchmarking rwkv inference in the Figure 7 of the paper? From my side, i couldn't get the same results.
this is the code i use: https://gist.github.com/Hannibal046/b57f44779484b466f3d33f537c87443d
and this is the result:
on one A100, float32, no compile, batch=1, generate 1024 new tokens
BTW, the Figure 7 in paper is never referred or explained
Are you looking at the arXiv copy or the new copy
emnlp version
how are you using RWKV?
these were the scripts used
probably it would be better to release scripts in the open for people to reproduce. @snow zealot are you ok?
thanks so much! I would check this
Is it ok for EMNLP?
i notice that rwkv is tested with original implementation rather than HF implementaion
is there any problems with HF implementation yet?
I simply uses this:
The NLP evaluations in the paper use the models found herehttps://huggingface.co/RWKV
yeah, i test models from this HF space. But from using model.generate() method rather than forward() with torch profile, there is not that huge gap as shown in the paper
i mean the RWKV codebase is public... and the preprint as well... so i think as long as we dont promote it it should be... but yea we can wait probably
when we run the tests the HF implementation had some bugs
#1076516707201466388 message
It seems the benchmark scripts doesn't use kv cache for transformer-based model?
I believe it would be a more equitable comparison if we could also pass the KV cache to the Transformer while providing the state to the RWKV. This would ensure a fair assessment of both methods.
kind of, but then you'll OOM faster wont you?
why kv cache would cause faster OOM compared with full computation?
Nice write-up! I think the decoder sequence length and the hidden states of the model might be too small to see a difference here in VRAM. The reason VRAM should be higher when caching the k,v states is because we cache the projected k,v states of every layer. This means that our cache is of size: 2 * (hidden_size) * (num_layers) * (decoder_l...
ok, for 80G a100, batchsize=1, would this be a big problem?
hmm i see... okay we can try that? is it easy to setup in HF?
seems all GPTNeoXForCausalLM have the use_cache=True option we could use.
yeah, all AutoModelForCausalLM in HF have use_cache option
but it is only useful when calling model.generate() method
doesn't the forward method work? it seems from the docs it returns KV as well, which you can pass to next call
this is done
here "useful", i mean it was only used in when model needs to generate something. It certainly works for forward() since model.generate() consists of bunches of forward()
ok. ill modify the script and will re-run the experiment soon (tmrw?). with that to see if there's any diff.
I believe this problem presents a certain level of complexity, as the real-time cost is determined by a combination of factors such as the algorithm (architecture) and hardware (V-RAM, GPU generation). There are numerous options for benchmarking the inference speed by combining these elements, such as using a GPU with small V-RAM, a GPU without tensor cores, or a GPU that does not support bf16, among others.
However, the most straightforward approach, in my opinion, would be the following:
start_time = time.time()
new_tokens = GPT/RWKV.generate()
end_time = time.time()
Even with this method, there are still various possible variations. For instance, if we were to test on a GPU with limited V-RAM, a transformer-based model with kv cache might need to perform frequent exchanges between GPU and CPU memory, which could result in significant latency.
I would like to kindly recommend that, for a model with favorable time and space complexity during inference, it would be beneficial to utilize a product-level GPU such as the K40 for comparisons with other Transformer-based models. It is worth noting that employing an A100 GPU for serving is not a common practice within the industry.
we run the tests on an a100 80gb gpu which is the best as it can get atm imo
why it is the best experimental setting?
Hi, @tropic minnow , do you finish the code? I implement it on my side and the results are contrary. This is on one A100(80G) gpu.
with this gist of code: https://gist.github.com/Hannibal046/b57f44779484b466f3d33f537c87443d
i cann't think of a reason not using kv cache for transformer model in inference
and the memory usage of this code is also wrong. It keeps gradient.
so i would suggest using product-level GPU/ longer context/ large model size to show the superiority of a inference-friendly model, RWKV.
the current figure in the paper is misleading
sorry been busy
hmm well we can try detaching
RWKV HF package is still buggy. pls use rwkv pip package with
os.environ['RWKV_JIT_ON'] = '1'
os.environ["RWKV_CUDA_ON"] = '1'
example code: https://github.com/BlinkDL/ChatRWKV/blob/main/API_DEMO.py
hi, @obsidian quest where is the buggy part of RWKV HF implementation? Maybe i can help fix it. The main point here is that if we use a large enough and fast enough GPU(A100) to benchmark inference speed, the Transformer is also linear. Check this:
I would recommend a detailed derivation here: https://kexue.fm/archives/8610.
you should show "time to generate the # token" (the derivative) instead of cumulative time
try rwkv pip package first
with the correct implementation, rwkv will be like a const line around 10ms
ok, i would try rwkv pip
hi, I tried rwkv package and this is the result
but i am not so sure if this is a fair comparison
rwkv pip package is using pytorch for almost everything, except the WKV operator
while HF transformers are using MHA operator instead
both WKV and MHA are CUDA operators, so i will say it's a fair comparison
but rwkv pip uses torchscript, right? Actually, transformers could be fast enough with various optimization techniques (e.g. vllm)
and i have looked through the HF implementation, while no obvious bugs found
one possible difference was: HF implementation doesn't use wkv kernel when doing inference
oh, i find that HF implementation could be significantly boosted by using torch.compile
can you try it for rwkv-pip too?
the pip version is using torch 1.x jit, might be less efficient than 2.0 compile
hi, I tried. But jit in rwkv-pip is not compatible with torch.compile
I am using HF implementation with torch.compile and longer context length
this complies with the induction that Transformers are only quadratic when # tokens is big enough
transformers per-token speed = const factor + linear factor
accumulated time = linear factor + quadratic factor
can you show how to do torch.compile for HF rwkv thank you
Check the gist of code above
Agree, but the figure in the current paper compares transformers without kv cache with rnn-rwkv
From long former paper
we can compare torch.compile(HF implementation) of all models (and with kv cache)
accumulated & per-token
yes, that is what i mean
the current one only evaluate on 1k context, where the Transformer and RWKV are both linear
hope this could be reflected on the next version of our paper
yes it can be! if we could get all figures ready (this one for 3b), but also about 8192 tokens generation for different sizes that would be great
I'm also hoping to have MMLU numbers in the next version (though have deprioritized this as we can't update the paper for a couple months still)
one click run. The commands are attached below. And you can simply switch to any HF model. https://github.com/Hannibal046/nanoRWKV/blob/main/benchmark_inference_time.py
why can't we update the arxiv version now? EMNLP seems fine for this?
do you have the MMLU numbers for now? happy to see
EMNLP is very explicitly not fine with this. Doing so will get our paper rejected.
hhhh, i didn't see it in call for paper. I must have missed something
You may not make a non-anonymized version of your paper available online to the general community (for example, via a preprint server) during the anonymity period. Versions of the paper include papers having essentially the same scientific content but possibly differing in minor details (including title and structure) and/or in length.
[...]
You may not update the non-anonymized version during the anonymity period, and we ask you not to advertise it on social media or take other actions that would further compromise double-blind reviewing during the anonymity period.
https://2023.emnlp.org/calls/main_conference_papers/#anonymity-period
Official website for the 2023 Conference on Empirical Methods in Natural Language Processing
got it. thanks
it's bad at multiple choices. requires more such training data lol
Pythia is trained on the same data
@here I've made a short survey that I would appreciate people taking a moment to fill out. The primary goal is to get a better understanding of who comprises the members of our community. It should just take a minute and will be very useful 🙏
https://twitter.com/BlinkDL_AI/status/1677593798531223552 A tiny RWKV with 2.9M (!) params can solve 18239.715 * 9.728263 or 4.2379 * 564.778 - 1209.01 etc. with CoT, while being 100% RNN (L6-D192) 🤯
A tiny #RWKV with 2.9M (!) params can solve 18239.7159.728263 or 4.2379564.778-1209.01 etc. with CoT, while being 100% #RNN (L6-D192)🤯The trick: generate lots of data with reversed numbers (denoted by "f" here) to train the model🚀Try it now: https://t.co/l7CDb6Rirl
Hey all, I've just ported RWKV to Fortran! 🚀 Please take a quick look here: https://github.com/FortAI-Hub/rwkv.f90. Would love to hear your thoughts!
@obsidian quest Let’s start keeping notes on adding languages to RWKV, in case you want to write another paper. It’ll make it easier to not have to go back and figure out what was done after the fact!
Not sure whats the procedure for paper feedback / corrections is - rwkv is cited here : https://arxiv.org/abs/2307.08621 - as a model without "training parallelization"
(hoping for someone here to know the process)
In this work, we propose Retentive Network (RetNet) as a foundation
architecture for large language models, simultaneously achieving training
parallelism, low-cost inference, and good performance. We theoretically derive
the connection between recurrence and attention. Then we propose the retention
mechanism for sequence modeling, which supports...
the current implementation of RWKV training is indeed recurrent
but in theory, i believe it is also parallelizable
wkv_{t} actually doesn't depend on wkv_{t-1}
this retentive model uses a bunch of tricks to train while only refering to RWKV as Transformer with Time-mixing..
ahh so if i understood you right, they are using the stricter definition of training parallelisation? So they ain't wrong - but in practise is a meaningless distinction, because we can saturate our GPUs eitherway
and i am not sure if they are using torch.complextfloat, which may cause additional overhead
also curious, is there any Linearized Attention models scaling up?
i dun think they changed the RWKV code much - imo, cause there isn't a reason to do so
I guess it boils to the definition of how you define parallelization. This is currently my understanding on how RWKV runs in parallel.
x axis, is tokens, y is the layers somewhat, orange is layer norm, purple is time mix, green is channel mix
like strictly speaking everything past the first layer norm, does depend on the previous tiles - so if you define parallelization as being able to "compute independently" of other tokens then yes - we are a "not parallelizable" in that regard i guess?
even though in practise RWKV is still able to rapidly ramp up, and saturate the GPU across the multiple layers
which fits my understanding of "training parallelization" where it is more of "can we split the training process of a single data sample into enough threads to saturate a GPU" haha
please tell which repository implement this computing flow of RWKV
if your referring to the digram? - i dun know if it's fully implemented, or partially implemented
its a visualization i have on what is potentially the optimal flow for RWKV from my understanding of the architecture
i assume its atleast partially implemented in the main repo, with pytorch / JIT / etc. If not we would never be able to saturate the GPU otherwise
( might need to get blink to confirm / deny how it flows in the main repo )
IMO, computing green box(channel mix) in parallel would be much faster..
i have read the source code in the main repo, it is computed sequentially layer-wise and time-wise
yea, but pytorch if i understood correctly, builds a computational graph, and automatically split up the work to run in parallel ?
the question is more of does it actually do it the way we understand it to be haha
if they consider large convolutions to be "parallelizable", then RWKV is certainly parallelizable
so i guess next step is to ping the author? not sure if they listed the twitter social media account in the paper (probably not?)
ok it's basically linear transformer + xPos + exponential decay
so most of the tricks are parallel to rwkv, and i can add them too
now coding it to test
Haha. I will gladly run some experiments if you let me know the changes
agreed. Another kind of linearized attention
their code would be released within one week as said in the github repo
Someone made an unofficial implementation: https://github.com/Jamie-Stirling/RetNet
wow, amazing!
does anyone know much about complex in torch? wouldn't this cause huge latency compared with fp16 with tensorcore?
RWKV-5 preview with trainable time_decay
add --my_testing "r" to use it
https://github.com/BlinkDL/RWKV-LM/commit/9143748f8079e7d3c726c2b98a83681242da30f7
now with trainable time_first too:
https://github.com/BlinkDL/RWKV-LM/commit/686c962008676809f17cf2424c193d9dc217c0e4
For followup paper ideas, to the RWKV paper - would it be best to post it here, or another thread under publishing-help ?
are there some promising results for rwkv-5?
Sort of, though its not part of rwkv-5 yet.
I think i will outline it here first (let me know if i should repost this separately). As this is a compilation of an ongoing experiment between me and a few members of the RWKV community.
RWKV memory experiment v5/wavenet - update 1
While RWKV is able to match transformer performance on a wide variety of task. It generally stumble upon tasks with large data inputs or randomised datasets, that would need to be compressed and stored within its internal state by the model (Large document Q&A is a major example) - within the RWKV community, this is considered its "only weakness"
As such an ongoing effort to quantify, and benchmark this memory capacity was started, where we measure the model performance on receiving randomised english words token, and replying with said tokens
Instruction: Repeat this text exactly as it is
Input: <random word tokens>
Output: <output to benchmark>
In general transformer models when trained to handle this task has no issue with the lookback and providing a full response (within its context length)
The following is the score for raven / custom rwkv4 models
It is important to stress that this should be considered as worse case scenerio memory capacity, as the raven model has been shown to be able to compress down large common concepts into its memories, far exceeding these numbers.
Randomized text was intentionally chosen, to represent worse case numbers, as training cannot help form a pattern for these text
Subsequently with a standardised benchmark we have internally, we came up with the means of training the model from scratch, and to replicate the results - without needing to train an entirely new PilePlus+Raven model
This allowed us to perform experiments into improving memory capacity. The biggest impact as of now is the change to the channel-shift layer, in how tokenshift is done, into a structure that resembles a wavenet. (this is only a few line of changes)
Where we now have a TokenShift 430M model that out performs the raven 14B model in memory recall task - further more this is shown to be scalable upwards, with our TokenShift 1.5B model doubling the 430M performance
We are training a slightly larger model (L24-D5120), which we believe will be able to retain into memories more then a 1k tokens. Making this within transformer level context sizes.
It is believed that these modification to raven 14B, would allow it to have perfect recall of 4k tokens (or higher)
(Currently our experiments are bottlenecked by our GPU capacity)
We posit that this heightened perfect recall of token memory, at par with transformers context length, could remove the last obstacles preventing RWKV (or other RNN like) architectures from superseding transformers without any compromise.
As it fixes the last set of tasks that it loses out to transformer models in
Notes:
The tokenshift memory models trained, have very limited general text model training, we do not know as of now if this process will benefit or hinder subsequent model performance in other task if trained on the pile + etc - the assumption is that it will be an overall benefit. Changes was only done to channel mixing, with time-mixing kept the same. Which we believe will help it retain existing reasoning capabilities.
Since blinks upcoming RWKV-5 changes is only done on the time-mix layer, these changes could potentially be merged and used together.
Currently we do plan to perform memory training and testing on the time-mix rwkv-5 changes, without the tokenshift changes - and subsequently with
we drafted the following abstract, and since the members involved have limited to no experience with papers - nor the GPU capacity to take this idea further then memory training (ie. pile+, and instruct tuning)
here i am 😅
(wavenet architecture, on how the token information flow through the layers)
the change is to swap from this causal convolution structure here, to the dilated wavenet above
at least for the first 12 layers
( @misty cedar was the one who made the bulk of the code changes, for these advancements 😉 )
Does the wave net still have an RNN form?
Yes
We have rnn inference code allready written for it
Is it public or are you waiting to publish before sharing it? I'm curious how you turned a CNN structure into a single state
It's public, you can find details in the rwkv discord,
Basically, you just have an array for each layer that is of shape ( 2**layerID, dims ), you swap out your state object with the last item, and then do a roll on the array
During your training and task, are you using random offset and window size before asking it to recall? If the window and offset are fixed, it's easy to learn to do this even with a 1L long conv and it doesn't generalize.
all the code and the notebook is currently public 👍
we train and test recall task from 2 token all the way to 1000 token
Is there link?
For the tokenshift varient all the notebook for the runs can be found here : https://github.com/PicoCreator/RWKV-LM-LoRA/tree/rwkv5x-tokenshift-exp-A/notebook/experiment/tokenshift-exp
For tokenshift model C itself its here: https://github.com/PicoCreator/RWKV-LM-LoRA/blob/rwkv5x-tokenshift-exp-A/notebook/experiment/tokenshift-exp/TokenShift-C/TokenShift-C-mem-finetune.ipynb
(apologies if the experiments are mislabeled here and there due to copy and pasting)
For all the benchmark data, including charting, it can be found here : https://github.com/PicoCreator/RWKV-LM-LoRA/blob/picocreator-memory-experiment/notebook/experiment/memory-bench/Charting-benchmark.ipynb
abit harder to explain then the simplified table shown : The following scores the output, for each model - x is the input token size that was tested, y is the score (0 means perfect recall)
the table selects the best score based on their respective criteria (among the various tested prompt length)
layers above 12 are normal unaltered rwkv layers
You may be interested in this paper, seems similar to what you've proposed: https://arxiv.org/abs/2305.01638
The reason I'm asking all these questions is I'm playing around with it.
Efficiently capturing the long-range patterns in sequential data sources
salient to a given task -- such as classification and generative modeling --
poses a fundamental challenge. Popular approaches in the space tradeoff between
the memory burden of brute-force enumeration and comparison, as in
transformers, the computational burden of complica...
Nice to see multi-channel proven here for audio (an experiment we planned to try next in pipeline) 😆
wow, this is crazy, a linear attention-based model with 175B parameters, which could been trained in parallel and do generation recurrently
https://arxiv.org/abs/2307.14995
We present TransNormerLLM, the first linear attention-based Large Language
Model (LLM) that outperforms conventional softmax attention-based models in
terms of both accuracy and efficiency. TransNormerLLM evolves from the previous
linear attention architecture TransNormer by making advanced modifications that
include positional embedding, linear...
Unfortunately there’s no evidence in this paper that it actually works
The only evaluations they do are of partially trained models with 1B parameters or fewer
argreed. The evaluation setting is weak for now. But I heard from some insider that the larger model is under training now.
Call me crazy, but I would train a big model and test its performance before telling the world I made a breakthrough
RWKV-5-World-0.1B-v1-OnlyForTest_37%_trained-20230728-ctx4096.pth uploaded https://huggingface.co/BlinkDL/rwkv-5-world/tree/main
supported in rwkv pip package 0.8.7
0.1B world:
RWKV-5 37% trained = LAMBADA ppl 18.1 acc 42.93%
RWKV-4 100% trained = LAMBADA ppl 25.5 acc 36.29%
Interesting fact: RWKV-5 is great at benchmarks (excellent zeroshot performance), but generates quite worse music (just like GPT models) despite lower loss. (try https://huggingface.co/BlinkDL/rwkv-5-music)
This fits my theory: Dot-product is good for uncreative work, while Channelwise is good for creative work.
lets ask @sullen horizon #1103039376184852622 message
didnt realise they only trained 1B models so far
isn't this the same as setting up a ~200B RWKV model, and training for 1 step, and projecting the rest of the loss line 🙈
It’s a little less egregious, but still obviously bad
RWKV memory experiment v5/wavenet - update 2
(please let me know if i should shift this into a seperate thread)
We done a 7.5 / 15% codeparrot dataset train, on both baseline rwkv4 code, and rwkv4+tokenshift to see if the changes have negative impact on the model capability in other task. All 3 are the same 1.5B param models param
From a loss point of view all 3 models converged into similar loss levels, indicating that the token shift changes may not have adverse negative impact on other task
It is also interesting to note that the codeparrot model itself, had an average loss of 2.06 against its validation dataset - meaning all 3 models despite being trained significantly less - may outperform the codeparrot model
Asking for feedback on how to move these changes / experiments forward - aka what are good evals / tasks to train / validate on which would make good use of the extended memory - ideally without needing to train a full model
and you can use https://huggingface.co/datasets/bigcode/starcoderdata @void quartz
This looks so much like your code, no? (the RNN form)
AFT = headsz 1 version of LinearTransformer
RWKV4 = ExponentialDecay + Headsz 1
RetNet = ExponentialDecay + Headsz 256, with xPos too (but I find it can be removed)
RWKV5 = ExponentialDecay + Headsz 64, best performance
Headsz N = N x larger state (more vram, slower, still much better than KV cache), helps memorization
@hushed flare see #1103039376184852622 message
I'm surprised there's so much of a difference -- I would have thought if you heave head size = 1 then you'd end up with multiple heads learning the same kernel if it was beneficial
0.1 loss difference is not too much for a small model. it's similar to the loss of a 30% larger model
yea i think having a a fully trained pure code model based on RWKV might be of interest, especially if we can validate that it does better then existing model
(the original codeparrot was pure python, easier to test with limited training)
We’re currently making an improved reprocessing of the StarCoder dataset that can be used for this. But TBH I think code + language is probably better for most things than mono code
true, considering prompt to code is a large use case
if this is a use case in which RWKV can shine above existing models, it would be a huge step forward for the architecture recognition to the wider audience
(i think this issue is now more of a GPU compute issue to train such a model)
Great work, y’all’re getting noticed 🙂
At ICML I mentioned RWKV a couple times in a couple conversations and a bunch of people I was talking to knew about it
Just out of curiosity I trained the RWKV V4 on the code, https://github.com/SSamDav/rwkv-long-range-arena/tree/main, from @sullen horizon on 3 LRA benchmarks (listops, imdb, aan) here, https://wandb.ai/ssamtheboy/lra-benchmark, are the results.
The listops results are a bit sketchy, because I had a run yesterday that performed much better. Probably I need to change the default parameters.
Each run has a note saying each dataset it corresponds.
Gotten someone from the RetNet side to clarify what they meant about RWKV not being "training parallelization".
It basically boil down to the fact we need to compute the previous tokens state for the next token state in a data sample in a sequence. And nothing to do with GPU usage - which to be fair - is very true
The full convo is on github here https://github.com/microsoft/unilm/issues/1243
But i hope that helps clear the air on that topic
(so put down those pitchforks folks)
Still seems like they are bending the definition. They're method also uses recurrence within blocks and within those blocks, the computation cannot be parallelized (as they define it). So, in my understanding, their method also needs to compute the previous token state in order to calculate the current token state, and thus is also not parallelizable according to that definition
exactly
😅 yea im still kinda off on the definition.
Cause while it's true that we do not need to have "state" precomputed in transformers.
Dun i also need to compute all the previous tokens in parallel and apply attention, even more so uniquely for every token i generate. Somehow all that additional compute cost is better than having a state between tokens?
(not the expert here, so im gonna take the explanation that its not about throughput as it is)
RWKV memory experiment v5/wavenet - update 3
Continuing the series of stress testing the v5 (rotary embedding) and v5+wavenet changes - for memory storage and capacity of random words.
We have now officially passed the 1k token mark, with the 1.5B model able to keep upto 1.7k tokens in memory.
Because this is near the current training limit (of 2k), it is possible that the real limit is higher.
Wavenet preview is only 75% trained compared to baseline v5, however its on track to similar performance range (currently 1.5k from testing of the preview - vs - baseline 1.7k)
Tune5 (for both models), which will train it with up to 4k inputs/outputs, is estimated for 48+ more hours
However the big thing is, putting the technical progress aside...
RWKV-v5 with or without wavenet, is now officially in transformer territory range of being able to lookback into its inputs
(and hopefully pay attention to them too! which we believe it should, as these changes show no penalty in enwiki/code loss training compared to v4 - once proven, this brings us much closer to having RWKV being a full replacement to transformers with no compromises)
This is also strong evidence (v4 vs v5) of rotary embeddings, being able to encode and handle relative positional information
The problem of using the rotary embedding is then we loose the inf context no?
I would hope so! That's what they're designed to do 🙂
not sure how to fully explain blink changes, its rotary + timemix still, so infctx still works
there is no absolute positional encoding in it
( apologies for confusion, blink basically explained to me how v5 changes were not rotary, so it was a misunderstanding on my part when i visualised how the model changes worked - still the numbers are as benchmarked )
yeah my v5 implementation does not have rotary, nor xpos 🙂 it's pos.emb-free
Do you have any description of the current implementation v5? I’ve read so many things that I’m a bit confused. If you confirm the link to the actual code I can also figure it out by myself 🙂
everything here #1103039376184852622 message
standalone implementation https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py
(compare with https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py)
Thank you! Do you consider it a stable v5 or is really much under experimentation?
stable. can still be improved a bit but requires another CUDA kernel
Perfect, it means that I can implement the v5 to rwkv.f90 (the port in Fortran)!
very late, but: I would be willing to bet this is a LaTeX addon and it should probably be in overleaf somewhere
ignore all the wavenet/tokenshift stuff - those are not in v5.
i found this useful for just extracting the delta:
https://github.com/BlinkDL/RWKV-LM/compare/a637aea61c77cedd290054449d819da5e7b19d44...main
EMNLP reviews should be coming out in about 1 week. Based on the community's reaction to RWKV, are there any specific feedback that we're expecting from reviewers? Are there any experiments that we can get a head start on now?
A scaling laws study that's better designed to draw inference about the param-to-token ratio would be a good idea. We did a pretty good job with the time we had, but I haven't had the bandwidth to figure out exactly what we want yet.
I guess we'll see them closer to end 22nd AOE, rather than daytime on the 22nd 🙂 But hopefully they're not late
The reviews are out!
Reviews are looking positive to me
I'll put up a rebuttal skeleton and revision work list later
R Zd3h's first reason to reject establishes a fine anchoring for the paper, I think; if their complaint is an argument that RWKV is not as impactful as the transformer architecture, then things are going well. Always nice to get an "Excitement: Transformative"
We admit that this paper is not likely to be as impactful as the most important NLP paper of the past five years, but we hope the reviewers will not hold it against us too harshly. If in five years time we have 10% the citations as Vaswani et al. (2017), we will content ourselves with merely 9,000 citations.
And this is without stating where the transformer is weaker than RWKV! I'm not sure the seq2seq transformer would have an easy time through review had RWKV appeared first: data inefficient, fixed context window, very memory hungry,...
I don't understand. Just reject because it's a non-Transformer architecture?
I kind of agree with reviewer 85wr and it would greatly improve the readability of our paper
if we could add RKWV-v1 to this paper, it would be a more smooth transition from AFT to RWKV
AFT to RWKV-v1: absolute position score to relative decay score (rnn shows up here)
v1 to v4: single decay score to channel-wise decay (plus a lot more like u vector, init method...)
It's just one opinion, I don't think it would sink the submission
"Transformative" yet not a "Transformer"🤔
I'm not familiar with AI paper reviewing at all, but the "reasons to reject" section looks like "weakness" in Sys/PL conference reviews to me
reviewer HNDB even asked for fp16/bf16 training information in their "reasons to reject" part....
Review from 85wr is a bit harsh... will require some work