#RWKV-papers
1 messages · Page 5 of 1
how's the paper going 🙂
i gotta flesh out and add citations for the background section
@gusty condor would you mind if I try some new language for the introduction? feel free to throw it out if you don't like it as much
please feel free to look over the current draft and give us critiques or suggestions
either here or via comments in the overleaf itself
got links?
it's a bit unfortunate that we used "RWKV" instead of "RWKV-4" lol
#1103039376184852622 message
Restricted, sorry you don’t have permission to load this page.
https://www.overleaf.com/1623283552mkymjtvsnybt#bd0fc2
Keyword: RWKV-5, RWKV-6, RWKV-X, article, paper, link, overleaf
(Please pin this message to avoid searching for keywords)
btw i can help work on memory, and long range dependency benchmarks - since that was an area i was actively testing previously
(if it makes sense to fit it in)
It's yours
(side: does it make sense to branch the tokenizer to its own paper?, saw that section)
No, do you have enough information to fit that into a 8-page-long paper?
measuring the token efficiency across multiple languages is probably NOT 8 pages 😂
I can work for the Background 2.1 (if it is okay), I want to write a blog for this topic for a long time😂
there is a side tangent, of seeing if a model perform better with the new tokenizer in another language (and english), compared to baseline - which might add up pages
(we are kinda assuming it gives better results, kind of - it is more token efficient for sure)
cause being trie based only, flys against current convention wisdom of BPE tokenizer
Maybe can submit a short paper instead of regular size?
4 pages instead of 8? If do not have enough info to fit 8 pages?
- Introduction and 2. Background already sort of covered the placeholder for 2.1. Rest is just formatting.
So far, it sounds like we've tried a bunch of stuff, and it worked. Adding some material on the motivation and theory behind things would be great.
Oh I see, few lines in Sec.2 mentioned these topics, thanks!
Yeah, there are too many tricks on RWKV and it works well… You mean that maybe we need to think about the motivations and theory for that?
I think we're just missing something like, 'The original RWKV architecture has limitations when it comes to X, Y, and Z, so we decided to try RWKV-5 to address X and Y, and RWKV-6 to address Z.
I wonder if we need to pin this link to avoid drowning in the message flow. There is someone who might be willing to help but couldn't find the link.
Oh I see. in some degree like a chronicles of RWKV? How do we evolve from RWKV-4 and why we decided to add these features with this sequence
pinned!
Not chronicles. Come from the standpoint of resolving the shortcomings, and don't go into the history at all. See #1103039376184852622 message
Please do!
Got it! Thanks so much. Like in short words to describe “RWKV4 has these shortcomings, and we need to solve them, then describe how we use RWKV 5/6 to solve”?
No, that would be the design section. The background section is for getting the audience quickly up to speed on important precursor concepts from before this paper. No new designs or shortcomings of past designs should be included. Read the first RWKV paper for reference.
Oh sorry I misunderstood. So it will function more like a traditional related work section, for introducing some previous related work while introducing concepts that will be frequently used in the following paper?
No, related work is for comparing/contrasting your current contributions with those of others. Background is for foundational concepts that need to be understood before reading the design. Check the first RWKV paper for this.
Oh I see there are separate parts in the first paper… I’ve never noticed here before. I think I finally got what we need in this section. thank you so much!
Ok so in comparing the arxiv-v1 and EMNLP versions of the first RWKV paper, I actually think we can just replace the current arxiv with the EMNLP version, and move directly to the RWKV-v5/v6 arch paper.
Edit: Ok, arxiv has been updated. Let's move forward with RWKV-v5/v6
High-level things that need done in the RWKV-X overleaf:
Background:
1. Subsection on RNNs (similar to first paper, but directly copy nothing. Reword at the very least)
2. Subsection on Transformers and AFT (again similar to first paper, but directly copy nothing. Reword at the least)
3. Subsection on RWKV-v4 (summarization of the first paper, with an architecture figure). Can probably retool the current section 3 header in RWKV-X at "RWKV Architecture Summary" for this, along with the start of section 3 in the EMNLP version
Related Work:
Use the first paper's related work in appendix C as a template. Remember that this is anonymous and we can't say this is our arch.
4. Reword and update related work from Appendix C as a base
5. Add any subsequent work (mamba, hyena, RWKV-v4, etc)
Design:
6. The existing subsections 4.x in RWKV-x need more explanation and we need new figures similar to Figures 2/3 from the RWKV-EMNLP
Evaluations:
7. Need a set of figures on downstream tasks comparing to transformer and SSM arches (including RWKV-v4). Similar to RWKV-EMNLP's figure 5
8. Need scaling law results like figure 4 of the mamba paper figure 4 of RWKV-EMNLP (see for context on why we don't want a figure like mamba)
Trained Models:
9. The existing section 5 and Table 1 in RWKV-X is pretty good. Some comments are to:
**9a ** Add a "Name" column like table 2 of RWKV-EMNLP,
9b Clarify that these equations are per-token
9c All of the subscript-5/6 should be updated to subscript-v5/v6 to make it more explicit that these refer to different arches
Several other sections need started, for which the task is "start".
I'm going to start by making the high-level structure a bit more clear, and make sections more contributor-friendly with TODO statements and section skeletons
@last mauve for adding more explanation to subsections 4.x in RWKV-x do you mean that we need description of what's going on and how it works mechanically because the formulae are currently unclear, or some description in that section of the motivation for why these mechanisms were chosen?
- Section 5 was my work
Both. For example, why use token shift and what does it mean intuitively? Is a figure possible? As a non-expert in RWKV-v5/v6, the raw formula is confusing without the context about how it fits into the overall model architecture and how it helps.
8: I actually rather dislike the scaling laws plot in the mamba paper. They do not seem to perform any search for the optimal token-to-parameter ratio and instead assume that it's the same as it is for transformers. In the scaling laws plot I added to the EMNLP version, as well as both Kaplan et al. and Hoffman et al., instead we search many combinations of (parameters, tokens) and then find the optimal configuration for each FLOP value and fit the curve to that.
The reason this is problematic is that it can disadvantage models that have different optimal tradeoffs. If they were just comparing to the optimal tradeoff identified in our paper or in Hoffman et al. that would be fine as it would only disadvantage their model, but they also do this for several competitor models. This makes it impossible to know if they're hurting themselves more than they're hurting the competition.
That plot is meaningful as an argument that the architecture is better because for a fixed (param, token) pair the architecture outperforms others, but it's not an argument that the optimal scaling is better because it doesn't remark on the optimal scaling regime at all.
Put another way, it's effectively the same plot as our "average of 12 benchmarks" plot but using Pile loss instead of 12 NLP benchmarks. It's not a scaling laws plot.
Gotcha. Updated
https://arxiv.org/abs/2305.13048 seems not updated yet
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transfo...
Should we name it
RWKV-5 and RWKV-6: xxx
CHANGE:
In this work we present RWKV-5, which builds on the architectural improvements and learned decays from RWKV-4, as well as the matrix valued states found in Linear Transformers.
(because it was proposed in Linear Transformers, not RetNet)
CHANGE:
Influenced by the Retention Network (RetNet) architecture ==> Influenced by the Linear Transformer architecture
GroupNorm = LayerNorm for each head. So no need to say it's GroupNorm.
Token 257-65529: actually includes lots of languages, not just Asian. and symbols.
Moreover it's a greedy tokenizer. Faster and Easier to code.
We can follow this narrative:
- Matrix-valued states were proposed in Linear Transformers.
- RWKV = [exp. decay + token shift + AFT]
- RetNet found [exp. decay + xPos + Linear Transformer] works
- So RWKV 5/6 is doing [exp. decay + token shift + Linear Transformer]. We don't use any extra postional embedding.
Moreover RWKV models are much better tuned than RetNet. We can show the loss curves.
And we should compare with Mamba, GateLoop, etc.
We can make a table:
- decay/gate: real-valued exp. decay, complex-valued, data-dependent etc.
- positional embedding
- state: RWKV4 = vector state, Mamba/SSM is like "multi-vector" state, and then we have matrix-valued states
Moreover RWKV models are much better tuned than RetNet. We can show the loss curves.
Does this means RWKV 5/6 are better at pretraining or at fine-tuning?
pretraining loss curve. train from scratch on new data
Do we need to conduct Chinchilla's scaling law experiments for 200M ~ 1B (or more params) ??
https://github.com/BlinkDL/nanoRWKV
nanoRWKV "x051a" - does not require custom CUDA kernel to train, so it works for any GPU / CPU.
https://twitter.com/BlinkDL_AI/status/1734254476218057170
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py
python sample.py --out_dir=out-shakespeare-char
I implemented many of these changes, though I think the introduction and implicit 'story' can still use more work.
If you could review the new parts I wrote about token shift in sections 4.1, 4.4, 4.5 that would be greatly appreciated. I tried my best to infer your rationale based on our limited discussion 🙂
In Introduction, we can mention "RWKV-5 applies..." before "Retentive Networks..." and we should mention Mamba (dynamic data-dependent decay) after RetNet
Extra Silu gate is used in Mamba too
We can mention RWKV-5-lite as a variant without custom cuda kernel requirement for training
rwkv5 rwkv6 were trained with 0.001 weight decay (only for matrix-valued weights: linear, emb)
mamba is utilizing SRAM for similar parallelization
I imagine this ordering was intended to both explain the progression of models over time and conclude with our contribution (I didn't originally write this particular section tho) I'm not sure we should mention RetNet in section 1 at all - imho it's better left to section 2 (Background).
@obsidian quest did you have any comments about the token shift descriptions? I want to make sure I'm not getting anything wrong about the rationale
token shift = induction head & locality a priori, similar to conv1d with kernel sz 2 too
what I said in the most recent draft is that token shift makes it possible to form induction heads within a single layer, and that the v6 token shift changes allow important information to flag itself for inclusion in the data stream, while unimportant information can similarly avoid incluson
yeah
and we can use this_token + last_token to detect this
#general message
We should emphasize RWKV-2-RNN was the first to show "exponential decay is all you need"
can add a section in appendix for the timeline of RWKV
Is it like a chronicle from RWKV-1 to RWKV-6? Maybe I talked this before😂
from https://arxiv.org/abs/2312.06635
This type of model with matrix-valued hidden states that change over time is also known as “fast weights"
yeah we should make Schmidhuber happy too 😂
Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with respect to output length) inference complexity. Recent works such as RetNet (Sun et al., 2023) and TransNormerLLM (Qin et al., 2023a) observe that adding a globa...
I was thinking about this idea, but @young sparrow disagreed on putting RWKV history in the paper, because that the paper aims to let readers catch up the current progress of RWKV architecture, rather than tracking the RWKV memory lane.
I agree here. No histories in a paper. This would instead be a very nice blog post.
I've been somewhat caught in the cross currents here, trying to thread the needle between 'just tell us how it works/what's new' and showing where it comes from. Currently trying to fix up the background to accomodate that while keeping it appropriate for a paper.
There's a very clear distinction on what's appropriate, I think. If the secondary info (e.g. intuition from a similar study/paper like "studies on CNNs demonstrate that shallow layers learn general representations while deep layers learn specific representations [cite]") on a design feature is included to help the reader understand how/why the design feature works, that's appropriate. If the secondary info is for any other reason (e.g. to claim ownership or to give interpersonal/organizational history like "we discovered XXX in May 2023 before Mamba"), then it violates double-blind and isn't appropriate for a paper.
Anything flies for a blog post, and I encourage people to post the history and demonstrate ownership there.
To be clear though, we're still able to make statements in the Background and Related Work sections such as "RWKV [cite] introduced exponential decay is all you need", but they can't be excessive and they can't violate double-blind
I think I've avoided adding anything that's inappropriate in terms of anonymity in all the sections I wrote or edited to date (of course feel free to correct me if not)
The push and pull for me is more just about what extra background we include in terms of the (often third party) developments that lead up to this combination that we call RWKV5 and 6, since I think Bo has expressed wanting that in the paper.
btw how many people here is at neurips?
(dropping by tmr)
#1171291697561477170
I'll also be arriving tomorrow evening! We should meet.
Sure! You going for workshops? Wondering if we should stay for that
Yep I'll be attending and presenting at the neural scaling laws workshop https://sites.google.com/mila.quebec/6thnslw-no/home?authuser=0
@void quartz I'm here all week, would love to meet you
Great!, see you both from tmr morning then 🙂
To help me in writing this paper, can someone in clear terms either explain or point me to something comparing the Mamba arch with RWKV-v4? Mamba will likely be our primary competing arch and I want to be able to strongly differentiate RWKV from mamba in the background/related sections of the upcoming paper
I guess with Based releasing benchmarks for multiprocessing using linear transformers, these graphs gain a little more relevancy.
@misty igloo - need your help confirming on mamba (as there were changes from statespaces) - its still log(n) right?
kinda dumb: Can we do a direct counter of the RetNet paper "parallel table", with a clear definition of parallel in the v5 paper
We got rejected for the OakRidge compute grant, over a new RNN (yet to be out, so no ideas the detail), that cited that retnet paper, and said that they fixed that parallel problem, and is the reason why RWKV could not scale past 14B.
that is very mean of them to speak so lol
we can easily demonstrate the training speed of rwkv is constant regardless of ctxlen
its not "retnet", its another team thats just quoting retnet
i know, still it's mean to claim we could not scale past 14B lol
no, it's like rwkv-6 - O(1) per time step, O(N) for sequence length N
ahhh thanks for clarifying
in case it helps, to quote from some stuff I put in the new paper but may not make it to the final cut:
Earlier SSMs were historically computed using long convolutions in $O(N\log N)$ time per sequence, but could also be formulated as recurrence relations. Recent SSMs featuring data-dependent $A$ and $B$ terms (GateLoop, Mamba) are only able to be formulated as recurrence relations. Generally, such recurrence relations can run in $O(N)$ time with respect to sequence length
Smerky
all gen6 designs are same
RWKV-4 and Mamba are quite different, but RWKV-6 and Mamba are much more similar
Mamba follows the traditional state space mechanism (more or less) of:
$h = h {\Delta A} + x {\Delta B} \
y = h C + x D$
where dB expands x into a new dimension and dA is supposedly a diagonalized version of something theoretically complicated
(I say supposedly because their code doesn't quite match their paper and some things are unexplained)
and C reduces the hidden state back to the embedding dimension
RWKV-6 is more like
$kv = (x W_k)^T (x W_v) \
h = h w + kv \
y = r (h + kv \cdot u)$
unfortunately, I don't know of a way to clearly show the differences between these
Smerky
There are other differences, as well... Mamba changed the traditional transformer layout from the usual blocks of sequential Attn and FFN to a unified new kind of block that expands 2x like a FFN, then does a short kernelsize 4 1D convolution analagous to but different from rwkv's tokenshift, then does the SSM, gates, and shrinks it 2x back out like the output projection of a FFN
Add that into paper 🙂 needs experiments
Why can't you just show those equations and say that's the difference?
For one thing, his original question was about the difference between rwkv-4 and mamba, which is so different it's somewhat hard to even compare them 🙂 (I showed rwkv-6 above since they're more similar)
I guess I'm also just not certain what Quentin's goal is in showing the differences so it's hard for me to know if that suffices 🙂 As seen above, their attention formulae have terms that are quite similar in some places... but there's a lot of nuance too, like because of the way the Mamba incoming projection replaces some of what normally would be the projection from inputs to values, and how multiplying out (k^T)(v) per head is different than just expanding the full input by a smaller new dimension via matrix dB
So despite being similar, the differences are quite complicated.
And just to add a cherry on top, the Mamba code appears NOT to quite match the paper. And Bo says that the results don't match either!!! (And that the reported results must employ some secret sauce that isn't in the publicly released code)
Fun stuff
The authors of Mamba will be giving a community talk at NeurIPS. Those attending the conference can go and ask them questions. 😉
sadly 😦 will miss it - me & harrison - our flight got delayed till 5pm
that comes from https://arxiv.org/pdf/2202.10447.pdf
please update RWKV-4 paper to use "RWKV-4" instead of RWKV 🙂
I wonder whether title is changeable
RWKV-4: Reinventing RNNs for the Transformer Era
Anyway, my opinion is that, if we matter anonymity, then that might be a bad idea (alluding that "we" have developed RWKV-1 to 3, and aiming for 5+), but if we are already famous, then that doesn't matter a bit (like OpenAI's articles are only posted on OpenAI's website and not anywhere else).
yes very very similar, except mamba computes their equivalents of q&k from the expanded V rather than directly from the input
that's a good way to compare that part in the paper, if we want to!
We could add a section that explains 1-3 is done as an open source research project?
Cause this is reflective of reality
(Do paper reviewers expect us to change reality to fit their version schema)
the double blind anonymity is only relevant during the peer review process, and the original paper already went through that
it's just meant to protect the review process so that there is no biased treatment of the paper i.e. for acceptance into a journal
We can change it on arXiv, but I think this is a very bad idea
People do not know about RWKV-6 and typically do not compare with work that doesn't even have a preprint because they don't know if it's "ready" or not. The best way to get people to compare to RWKV-6 is to write a paper about it
The paper has already been published and anonymity doesn't matter
meanwhile we can only try our best to point out they are comparing with rwkv4 😂
That's not what your tweet does. Your tweet accuses them of acting in bad faith.
they are using this opportunity
they certain know the existence of rwkv 5/6 and avoid mentioning it
I don't know that to be true and I think it's immoral to accuse them of that unless you are certain
Have they told you that?
some of them follow my twitter
That doesn't mean that they know that the models are finished
rwkv5 models were released long ago
That doesn't mean that they know that the models are finished.
And like I said, it's widely considered problematic to compare with unpublished work. Even if they know about it, they could be waiting for a paper and not trying to sneakily make themselves look good
Accusing them of acting in bad faith based on this evidence will only cause people to dislike you and not want to compare to your work
I cannot more strongly recommend that you stop doing this
I agree. Unfortunately, people look for published work (or a preprint) to compare.
We have no reason to believe anyone acted in bad faith. Very likely, this happened just because the researchers may not have realized the work is finished.
Also writing papers takes time. For all you know they finished the experiments a while ago and only just got the paper out
The right steps would be publishing our preprints faster and reaching out to authors if any claim is incorrect so that they correct them (eg the parallel table)
It's unfortunate that we don't have as much resources
Yes it is
The table in RetNet is certainly acting in bad faith, so I do think there is some hostility towards us as a potential competitor
I agree that they're not playing nicely.
I don’t know much about what happened. However, I strongly believe we should just continue doing good science
But this is still the wrong way to go about addressing this fact
yeah this is certainly the best method
We can add experiments correcting any of the possibly incorrect claims.
In the future I should make a disclaimer that my rants only represent myself and don't represent RWKV views 😂
Isn't there a RWKV Twitter? Using that to distribute release info would be helpful on both the reputational and the advertisement front
yeah can always feel free to use that to criticize me 🙂
How about we release a working paper on RWKV-5? It doesn’t need to be complete.
I don't think that that would be productive and don't have access to it
I am the kind of people who have the tendency to sometimes break rules as long as they don't harm others (and i will take / pay for the consequences too, will not avoid them) 😂 most people will hate me
It's my fault that we used "RWKV" for the RWKV-4 paper, and haven't published the RWKV-5/6 paper in time. Life is harsh 😂
Let’s limit the official rwkv twitter that we are starting to completed model releases. And less so on marketing future models
From a marketing hat point of view. If possible I would rather us push that with the 7B model launch
And also to not rush everyone working on it in this channel
So let’s aim for mid/late Jan?
For those who compared to v4 - we can ask them politely if they can add v5 to compare (the 1.5B / 3B models) when appropriate.
If they did so in good faith, they would be open to amend.
If they did it in bad faith, I doubt confronting them will change anything (like retnet)
Not your fault - it's important to show the v5 7B results and they take time with limited compute resources.
Is the plan to publish with full v5 7B results but a more limited set of v6 results? (1.5B or maybe 3B by preprint release time)
@spiral minnow just wanted to note that I removed your addition of quadratic memory complexity for transformers - that has been shown to be unnecessary e.g. flashattention
does it make sense to put v6 as future works?
im not sure if we would have 3B model fully ready by then
I think either way works, but if you want people to reference the best possible results for RWKV in future papers we should put it in now
This is exactly the same situation we're facing now with people quoting numbers from v4
I added all the formulae and descriptions so that we wouldn't fall behind
Just in case we were ready - if not, that's fine and we can delay the v6 paper easily
since we have it all written now
i would defer to those who know the academic norms then on this, was worried it just wierd that we added v6 without all the models
i think another direction if we want to push against this issue
is we need to publish blogs
yeah, just a question of whether a 1.5B model for v6 enough when we show up to 7B for v5
so it doesn't have the same rigor requirements for the paper, and is atleast official enough
alternatively, we could publish a working paper for v6 - but it's probably less work for it to remain integrated into the current paper
u know what - setting up an RWKV blog has been so long on my todo - just gonna set it up via substack
( classic coder conflict of wanting to do it better, but not having the time )
if you want upcoming papers to quote the rwkv-6 results, it has to be a paper and not just a blog post
at least then they will have to show the v6 1.5B results when comparing to their 1.5B results
we should also release a 125m model btw so people have a reference point
and any standard sizes people tend to use in between
since we can train those quickly
and it will help ensure that upcoming papers quote our best results
especially when they don't train larger versions, it's useful to have our small one shown to compare side by side
I'm personally in favor of keeping the two papers integrated as they are now, simply because it's less effort than making a whole new one. But I'm open to a separate rwkv-6 working paper or somesuch if our advisors think that's best!
should we train multiple v5 small models (125m??), with the combination of
- pile, world v2 partial (same token count), world v2 full
- gpt-neox, world tokenizer
so we can show the transition
if the result is close enough for the partial, it can close off a possible criticism that its not a fair compare with different dataset/tokenizer
not sure what the training mix should be - pile is of course the most standard - but having some smaller v6 available would really improve uptake of our best models in other upcoming papers comparisons
not sure if this is useful, or a waste of resource (which is already limited)
the idea is just all 3 x 2 varients
it'd be great, but I'm not trying to make more work or strain our resources... just any single 125m v6 model would probably help a lot
because it will force people to show it in comparisons when they only have their own small models to compare to
(this only helps if we publish a v6 paper tho)
can do one on pile, and one on slimpajama, when we have compute
Yes, but the dataset is currently at BlinkDL, and tokenizer size counts into parameters (L12 D768 is 193M, compared for 169M for Pile)
IMO - i think the world tokenizer needs a separate paper
been speaking to multiple researchers who are doing research specifically for their nation language model (and faced tokenization issue) and are working on their own region tokenizers
and there is lots of interest in how and why we did the world tokenizer without BPE, and what would be its compression ratio be for their own respective language
If proven out as things progress, the "trie tokenizer" approach can end up replacing BPE - if that makes sense - and this is completely seperate from the architecture
This makes sense to me. I think for this paper we should spend a short subsection, maybe a couple paragraphs, describing it and then we can go into more detail in the other paper
it's simply a greedy tokenizer. extremely simple to implement (trie is only for optimizations). yeah we can write a paper on this
Questions like - does it hurt evals - or learning rate was up in the air : which I could not answer accurately 😬
Intuitively the rwkv world model says it’s ok. But that’s a gut feel not a tested hypothesis
Using greedy tokenizers is very counterintuitive given how established BPE is
So same situation of RNN 2 years ago haha
already proven in rwkv world models. no need to change anything. similar results.
because my world tokenizer respects utf-8 boundary & word boundary. this is very important
otherwise you can have bad tokenization (such as "aliasing")
Arxiv submissions are always delayed. Try again.
I posted a Twitter thread about the paper update https://x.com/aieleuther/status/1736260370426114466?s=46
Could we retry this grant application for v5/6 ?
it was for v5 - and the dumbest thing was i already sent the direct github link - where the author or retnet acknowledge the definition is not about GPU parallelization
Wait this is absurd levels of BS
Did they tell you explicitly that this is why you were rejected?
no, but our rep is lodging a complain on that
that the preferred RNN candidate, uses the retnet claims, as justification to support them over us
(there is no paper, no materials, etc for the other group)
Does our application contain evidence to the contrary?
no - we had no idea we would had to fight that claim
we have provided multi-node training data - but our largest is 8 nodes?
Not having evidence that your model scales efficiently is typically a decent reason to reject
You don't need to run it for long, but you absolutely need to show the ability to leverage large scale resources effectively
i see, might be why our rep is trying to settle for a smaller grant amount - to prove out leveraging large scale resources specifically
cause it is a chicken and egg - we cant prove we can run on 1000 nodes, till we get limited access at least
they did ask as a follow up (before rejection) - have we ran on 1000 nodes, do we think it will work
- no, we never had such access to run at such scale
- yes, as we are built on pytorch lightning for multi-node training, which has been shown to scale past a 1000 nodes for deepspeed on transformer architecture. RWKV leverages pytorch lightning and deepspeed in the same way.
they did run with us across 100 nodes (for 1 hour?), as part of the validation, but we have no proof of going beyond a 100
Could we run some benchmark of many-nodes (> 1000 nodes) for another weak GPU infrastructures within 0.5 hour ?
This has been updated in the V5 paper. It should be pretty clear going forward.
I have no idea to find anything at that scale now. Even paid AWS / gcp / azure sets really low limits for new account
What about just renting something like this:
They still have GPUs, just really weak ones.
RWKV 2/3/4/5/6 all have similar complexity
Is it possible to apply another decreased number of GPUs as step-by-step manners ?
e.g. Running benchmark from 8 nodes to 256 nodes and estimating the 1024 performance
plotting y-(performance, training time elapsed) and x-(nodes=8, 16, 32, ..., 256)
@void quartz I know some people. Let me see about pulling some strings. Was your application for Frontier?
I can run RWKV at whatever scale you need. I never knew we were limited to 8.
I didn't know they had that followup. This could have easily been resolved. Am I not on this email chain?
Summit Plus / Frontier
The request now looks to be moved towards "Director request" for ~30,000 node hours - which can be used to prove out the scaling (and maybe do something useful with it)
Why are we saying transformer has memory complexity of N^2? That's been shown to be avoidable e.g. FlashAttention
I'm not sure that saying SSMs have memory complexity of NlogN is really correct, either
And what is the N in memory complexity? Many parts of this table don't seem right to me
Also, saying that RNNs can't do multi-gpu training is very questionable... since rwkv is an RNN
Maybe you mean a specific RNN architecture like LSTM?
That would be my fault - didn't understand the implications when they asked for previous run history, and treated it simply reporting what has been done
is it nvidia based? or AMD based? - i think it would be great if we can include short runs for a large model, across key scale sizes, and show their peak tokens/sec - and show a slight loss reduction
that can help disprove and kill off the "cannot train at scale" claim to rest
(ps: we had issues with the frontier AMD node scaling past 100, from what looks like node-to-node communication issues)
you folks probably know better at a 1000 node scale, architectually speaking since its just DDP training runs, and all of that is deepspeed - am i alright in understanding this is handled by deepspeed / pytorch lightning ?
Clarification, it was applied officially for summitPlus, but not sure the reasons - but it look like they wanted to test scaling on frontier - and was encouraging projects to go in that direction - and we went along with it?
heard megatron is much better at scaling
looks like its time to setup a new trainer - again 😂
ok got this PM "wait before using megatron, we will release soon a nanotron" 😂
You'd need to write a bunch of custom code to use Megatron, since it was designed for transformers
If you're going to put that work in, I highly recommend using GPT-NeoX which is a similar library to Megatron with DeepSpeed support and other custom features.
(Or, "I highly recommend chatting with Quentin about if it would be a good idea to add..."
the GPT-NeoX codebase is significantly easier to understand then Megatron itself
Quentin works very hard to make it so 🙂
This was meant to provide clarification on a frequently referenced table.
Ok, I've changed to "Vanilla Transformer" and "LSTMs".
Ah gotcha. Didn't remember where I had seen that table before 🙂
I'll take a look back at the retnet paper, but I think placed here it's missing some context that's important. Also, saying SSM is not the same as saying H3/S4/Hyena, since Mamba is a SSM (and also probably shows that those two can now be implemented in what would be called O(N))
I'm a bit worried that copying RetNet's table may not be a great path for us.
I mean, I'd go as far as to say that their data in that table is extremely misleading. We don't want to do the same thing!
Yeah, that's a good point!
This whole idea of long-sequence memory complexity that they claim is kind of a red herring. 😭
Maybe we can find an alternative way to point out the differences that show RWKV's benefits
And just to be clear, RWKV and Mamba are very similar in all these kinds of metrics. We shouldn't avoid that fact
By the way, looks like the S5 paper also has a somewhat similar table
that table presents a much fairer comparison imho
but 'parallel' yes/no for RNNs is still pretty misleading
actually I think this table is wrong too haha
the inference column is somewhat misleading
Ah, they sort of clarified earlier:
while also being parallelisable across the sequence dimension during training.```
MS's survey's asserts similar definition of parallelization... so that training for time clock T is possible before finishing the training past time (<T)
Rather than copy someone else's table, let's come up with a plan for what we're trying to show in comparison and figure out how to best represent that
and we can make it fair, unlike retnet paper
Ofcourse, Transformer's quadratic attentions is NOT parallelizable in MS's survey's definition because of fully connected matrix multiplication along time axis 🤣
For decoder-only models it is, if I understand you right
We do all of these simultaneously
Isn't this exactly what the "unrolling" at train-time for RWKV is for?
Isn't this exactly what the "unrolling" at train-time for RWKV is for?
I personally think it's exactly possible if we have a batch with 9 sequences in parallel.
rwkv5.1 does this sort of matrix multiplication, but rwkv5.2 and rwkv6 CUDA kernels don't bother to parallelize across time because it's highly effective to keep everything in gpu SRAM for a huge constant time speedup and obtain excellent parallelization over the non-time dimensions
mamba claims to use parallel associative scan to parallelize over time as well, but I haven't evaluated it to see if they actually do that in their CUDA code (their code often mismatches their paper in other ways so I'm a bit skeptical)
and to be clear, the current draft skips 5.1 and only describes 5.2 and 6
I also suppose that MS's survey authors assert that parallelizing over time as well with NO batched subsequences ( 9 seqs in Stella's image )
If my assumption is wrong, then I'm not sure about the attached table's definition
I don't really know what MS's survey idea is, or if it's at all reasonable, but I think we just need to try to be fair and descriptive
as i recall they already agreed to get rid of the training parallelization column in the next revision, according to @steady ether
https://arxiv.org/pdf/2312.00678.pdf is the survey. This preprint have the same table of retnet's preprint table.
doesn't matter, see #1103039376184852622 message
they have agreed to update it and remove that column entirely
- Batched Parallelization along time (RWKV-v4 and the other decoder-only models could do this type)
- Single Sequence Wise Parallelization along time ( Mamba asserts this type ?? )
to be clear, rwkv 5/6 can be implemented the same way as mamba claims w/ parallel scan - they just don't happen to be in the code released
I already state all of this in the draft
but we can certainly clean up that language if needed
I don't really understand what is being argued about here 🙂
Ah, sorry... my intention is to clarify a list of multiple definitions of "parallelization".
Type X parallelization, Type Y parallelization...etc.. and then wnna classify if each model has the type.
Rwkv v5 can be rewritten to be reliant on only a "cumulative sum with decay" operation for cross temporal information bleeding. V4 was the same... how do other linear models perform their operations in a way that they are more paralellizable than that? Chunked temporal information forwarding?
| Model | Type X-parallel | Type Y-parallel |
|---|---|---|
| name1 | Yes | Yes |
I see. Well afaict nearly every model discussed, except certain unrelated RNNs, can be implemented with parallelization across the time dimension
afaik you're right, they don't do anything more parallelizable at all
everything like this can be parallelized across time using associative scan
many of these models were not implemented this way, including RWKV 5.2, 6
but that doesn't mean they can't be if it were useful to do so
and afaik the only reason mamba is able to get away with doing so without a horrific constant time penalty is that they limit their effective head dimension to 16
but that's unrelated to computer science asymptotic time complexity calculations
Even then, you can implement a massive triangular decay matrix, multiply it to unmixed state, then do the cumsum using a tree algorythm for max parallelism. It's technically parallel, but it's so much more efficient to just do a scan
and afaik the only reason mamba is able to get away with doing so without a horrific constant time penalty is that they limit their effective head dimension to 16
Mamba's claim seems to depend on GPU RAM size ??
this whole discussion is just something that MS created by releasing TWO preprints with bogus analysis and false claims
and they have agreed to retract that part
so I still don't really understand the goal here for us
I think you're mixing up asymptotic time complexity with running time claims
We should probably put something out detailing our take because I don't have high confidence that they'll actually retract it.
do you have a suggestion on how to approach it?
Academic pettiness? Their incorrect claims may have negatively effected a compute grant, and "its not true unless it's in a paper" seems to be a prominent opinion
Like, a discussion section? Or a table of some sort?
The problem with a table is that essentially nearly all the models have the same entries in the table, in terms of asymptotic time complexity and parallelizability across time
I think it's a good idea to prep a blog post that shows the different tables, explains why they're wrong / explains the issues with succinctness, and presents a corrected table.
Maybe we won't release it for a while, but it'll be good to have on hand.
What's the academic equivilent of "as per my last email"?
"Contrary to (xyz et al, 12a section b), parralelization blah blah..."
"We raised this issue X months ago [link] and look forward to the promised forthcoming correction to xyz et al."
Ouch, that one stings haha
I checked and the survey authors said late December, so let's follow up with them in the New Year and work on everything else in the meantime.
If we want to make the table unique, we can compare it directly with some other foundation models that scaled to at least 7B, e.g., RWKV vs GLM, LLaMA-2, and Mistral.
Hmm, a table having the same entries is a bit problematic. Agree with blog / discussion section
Why
I'm on the fence. This paper is really about introducing V5, but it's also important that we clarify training parallelization.
I meant why is the table having the same entries problematic
There were some concerns about us using a very similar table to other papers.
Why would that be concerning
I guess not then? I made an assumption based on an earlier conversation: #1103039376184852622 message
Their RetNet paper is not receiving good feedback, see https://openreview.net/forum?id=UU9Icwbhin (especially Reviewer 8FpU), where the table is questioned the most.
The main weakness with this paper are overclaiming and lack of citations, which can be misleading for readers.
😏
Q1: “Impossible Triangle” is an absolute overclaim because RWKV and H3 have already demonstrated models are comparable to Transformers
A1: The claim is fair enough. The “comparable performance” means that the models achieve similar results under the same setting (e.g., #parameters, and training corpus). For example, previous comparisons use Transformers with absolute position while the compared methods benefit from relative position modeling. Moreover, in H3 paper, the comparable results are in hybrid settings (i.e., combine H3 and Transformer layers), but we don’t add any Transformer layers. We conducted various controlled experiments (with matched #parameters and using the same training corpus) to compare different architectures. We are confident that the claim holds well. The experiments in Table 4 also show that previous methods still have a big gap.
Q2: RWKV can indeed be computed in parallel.
A2: We give a clear definition on “training parallelization” in the caption of Table 1, which is discussed from the sequential perspective. “∗”: whether the training implementation is sequentially parallelized, although RWKV uses channel-wise parallelism. As stated in A1, RWKV’s performance is actually not comparable w
Q2: RWKV can indeed be computed in parallel.
A2: We give a clear definition on “training parallelization” in the caption of Table 1, which is discussed from the sequential perspective. “∗”: whether the training implementation is sequentially parallelized, although RWKV uses channel-wise parallelism. As stated in A1, RWKV’s performance is actually not comparable with Transformers according to our experiments (i.e., same #parameters, same data, and with relative position modelings). So, the statement of RWKV in Table 1 is fair enough.
Relative position modelings hurt RWKV performance? 🤔
Wow those aren't great reviews
The training parallelization definition is like Internet providers offering unlimited* data
*Notice the asterisk
secret sauce
torch.nn.functional.scaled_dot_product_attention
I know they don't help.
@misty igloo didn't you do a training run with/without RoPE?
I'm sure Bo tried it in the RWKV-4 era. I've tried it with and without for RWKV-5, but my implementation is not the official one, my runs were short, and certainly my results haven't been published anywhere. I wouldn't say I definitively know the answer, which is why there's currently an 'experiments needed' in the paper for this.
Also, should there be test runs on small models (<100 M) a la TinyStories?
No reason not to if we have time but I wouldn't remotely view it as a priority
Since the models aren't large, I can do it
lets add another column to the table: state size.
rwkv2/3/4 has the smallest state size of all models here. this is a plus in some scenarios.
it's the first and only design achieving good LM performance with such tiny states.
a rwkv4 with rwkv6 trick will be highly interesting.
Good idea
@misty igloo How are your experiments about RWKV's positional encoding going?
I haven't done any specific ones recently - but when I use token shift with traditional MHA I still need positional encoding for it to work well
my historically 'best' model is one that pairs some of the parts of rwkv like token shift with more traditional MHA
Yes, but RWKV's positional mechanism is more like
- Short term: token shift
- Long term: weight decay
I'm uncertain whether or not token shift supplies local positional information but I certainly agree that weight decay is what supplies positional information over longer distances.
I've tried using RWKV-style weight decay with traditional MHA and in my experience it works almost as well as ALiBi
I have some new models that use that new Based softmax approximation alongside RWKV style decay and linear attention and it works great
it's a trainable alibi. should be better.
alibi is slightly different: additive not multiplicative~~, and linear over time rather than exponential~~
then again, alibi only operates per head, unlike rwkv5.2/6+
no. it's exponential over time
exp(additive) 🙂
maybe my implementation is wrong 🙂
but it works great
this is what I meant about alibi being linear over time
(from their github)
maybe you meant the exponential part is the softmax applied to that
its possible my initializations were better for alibi and my trainings didnt run long enough, or maybe some other confounding factor
I wasn't specifically trying to drill down onto positional encoding at the time - just was trying to rapidly find the best mixed model for use with that Based approximation
(second order taylor series approximation of softmax)
yeah softmax has exp
true. need to run 10G tokens to see the difference
@obsidian quest regarding what @gusty condor was saying, do you think that token shift adds short term positional information?
(If so, I'd like to understand that aspect better so we can include it in the paper)
yes. and for ICL
Do you know what the difference(s) are?
I reviewed the paper and code again, and rescind my earlier claim - I do think the code matches Algorithm 2 from the paper.
regarding the second half about the results not matching, here's a link to Bo discussing his findings #1109810049607532555 message
It's not at all clear to me what someone is supposed to glean from this TBH
It's late here and I only glanced quickly, but what I believe is stated is:
- The linked paper is the "Gated Linear Attention Transformers" paper, which compares their new GLA architecture with Mamba based on the Mamba code on GitHub.
- GLA outperforms Mamba on multiple metrics, presumably in contrast to what is stated in the Mamba paper.
- For this to be the case, there must have been some trick to produce the numbers in the Mamba paper; naturally this couldn't be done in the GLA paper as it's an independent evaluation.
- Sure
- There is no contradiction between the Mamba paper and the GLA paper as far as I can tell.
- Or maybe the GLA people deliberately trained a bad Mamba paper. That's the thing about inferring bad faith on the part of other teams... it can justify just about anything.
this paper looks super interesting, is there still any tasks you could use another contributor? I see the paper is already out there on arxiv,etc so totally fine if its too late to join. Learnt a lot from this work tho :). Great work!
v4 is published, v5 is what's currently being worked on
Please contribute to RWKV5 paper at https://www.overleaf.com/1623283552mkymjtvsnybt#bd0fc2
is it worth putting in psuedo-torch for the non-academics?
or just link to code bases on github?
eg:
super naive rwkv v5 linear attention is:
(H = heads, C = dims, B=Batch, T=Time)
k = k.reshape(B,T,H,1,C//H)
v = v.reshape(B,T,H,C//H,1)
r = r.reshape(B,T,H,1,C//H)
kv = k@v // B, T, H, C//H, C//H
att = kv.cumsumwithdecay(decay, dim=1)
out = matmul(att, r)
# groupnorm and output head after this
with little effort you can fuse all these operations into a single kernal to save memory and compute ( fused kernal lowers intermediary memory usage from O(C^2) to O(C) , while being parallelization along B,H,and C )
Yes, put them in the appendix
yeah let's also put in the full recurrent formulation including u term in the appendix - its only a few lines of code
I think we should put in 'u' (bonus) into any version we list, so it represents the actual architecture
do we have a potential timeline when we would like to release the paper?
"As soon as possible." I think we're aiming to submit to CoLM, which doesn't have an anon period.
Oh nice, what a relief not having the anonymity restrictions
RWKV-5 is not finished yet, and 1 month without progress
I will join this
You can write out VisualRWKV experiments
Actually HGRN also achieved good LM performance using a state size of 2*d_model
Let me know if there's anything specific I can help with
#1103039376184852622 message
``` i think we could explain more about this. probably the promp-engineering sensitivity of rwkv-raven does not apply here as these are not specifically chat models, but one would expect error distribution to be similar (associative recall, etc). In the [based] blog post [ https://hazyresearch.stanford.edu/blog/2023-12-11-zoology2-based ] the stanford/hazyresearch team showcased a pitfall example, and i expect rwkv models to behave similarly. should we refer too that
I was able to reproduce some of their results using their synthetic test to train models from scratch, and it does seem that there is a noticeable gap.
I stopped because reproducing the entire experiment seemed computationally expensive.
This is with v4
@steady ether How much compute would you need to fully reproduce it
A quick estimate would suggest running 8xA100 for 10 days.
VRAM doesn't appear to be as crucial, so 8xA10 should suffice
I can help out
please try x051a too https://github.com/BlinkDL/nanoRWKV
v5 and v6 are very different from v4
I integrated https://github.com/BlinkDL/RWKV-LM/commit/e254e4c22ab2a1e178a56a7f7c470fbd63a3c80c and am currently running a test. So far, it appears to be better than V4. Completing 256/512 sequence lengths will take some time.
I think that version is x052?
yeah it's x052
and v6 is better
Anyone in SF atm? I’ve been offered a talk at a subquadratic attn meeting on the 25th afternoon which I plan to do virtually, but just in case
Me and Eugene are
@steady ether Is this the graph like mamba with induction heads
If so will it extend farther?
Like mamba going from 64 to a million?
Yeah, partially. It's v5.2, not the final tokenshift. So far, it slightly outperforms Mamba on the Stanford benchmark.
V5.2 testing is done (for that AR experiment). We can probably use Stanford's results for the other models to save on compute.
These results look amazing
Are we using the same eval code that they used? we should at a minimum confirm that we can reproduce their numbers
Yeah, it's the identical code for V4. I messaged them and they shared the code. I might also get their Wandb logs.
Our v4 runs are slightly different so there's some variance there. Same goes for the previous incomplete run with their other models.
Hello everyone.
hi Xaiat
hello everyone
Hello everyone
Hello everyone, I’m new to this community, but I’m eager to contribute to this project.
But, I am a bit confused about how to contribute. Do I just look at the text on overleaf and start editing them?
Also, would this paper be more interesting if it could add some evaluation or finetuning experiments on code generation tasks (like HumanEval). If so, I think I can contribute something like that.😁
Welcome! The final RWKV5.2 7B model checkpoints should be ready around Jan 29, so many of the main experiments will have to wait on that. If you have proposals for experiments you can do that would be useful to include in the paper and can be done via from-scratch pretraining, like the one @steady ether is doing, you could get started on those now. Also, see #1103039376184852622 message for a list of items to do (many have been at least partially completed at this point)
wanted to ask - whats the best / official way to do the needle in the heystack test, as i would be looking into that - i found several repos around this - but not sure which one is favoured academically speaking
Experiment:
Test perplexity over different context lengths (I think from 1 to 65536), to show that RWKV can handle and utilize longer context length than it was trained for (4096).
I need both intermediate models and corpus with very long documents.
you can probably benefit from the previous experiments @snow zealot did
this
basically these numbers, more discussion here: #1103039376184852622 message
Hi everyone, is there a detailed todo list?
#1103039376184852622 message this is a todo list for the paper I think
@last mauve it would be great to get an update on the todo list if you have time
Thanks. It would be better if there is a real-time todo list.
My current plan for the memory test
- (in progress) benchmarking the model memory size in a finetune to repeat its input
- (todo) perform needle in the heystack test using a modified version of : https://github.com/Arize-ai/LLMTest_NeedleInAHaystack2/tree/main (unless someone else have a better version to use) over large context length
- (todo) compile the results into the paper
I think it might be better to use perplexity test on natural data, not needle on synthetic data.
which repo should i use for that?
might as well just do all
- Select some long documents (length >= 65537) that is not included in the RWKV training set.
- compute cross entropy loss (or perplexity) at token 1,2, ..., 65536.
See if this helps
https://github.com/Jellyfish042/uncheatable_eval
got this message from "based" paper authors (stanford's attn-as-rnn-like model): We are currently running experiments for our paper and would like to include the newest architecture from the RWKV folks. do you know if the code for RWKV v6 is available? afaik there's no official open source implementation, and https://github.com/SmerkyG/gptcore/blob/main/model/experimental/rwkv6_0.py as the unofficial one, but after talking to @misty igloo we can't discard there's a bug so probably the safest is to tell them to just compare against v5?
@obsidian quest can you give them training code for x6 so the results appear in their paper?
it'd be an easy way we avoid the problem we currently have where everyone keeps showing v4 as the comparison
yes exactly
yes exactly
but this (https://github.com/BlinkDL/ChatRWKV/blob/ea1ccf40a42338442b2c4b2323354ad214e8f9a0/rwkv_pip_package/src/rwkv/model.py#L861) is just the inference code?
yes that's inference only, unfortunately
@obsidian quest any chance we can give them the v6 training code? they wont test it in the scale where v6 improvements kick in probably but we would get direct comparisons to "based" arch
Wait, do we not have a copy of the v6 training code?
they can use v5 as comparison (and we have models for this)
i plan to release v6 kernel together with trained v6 model in Feb
v5 is a strong benchmark and they can try it first
as shown in #1103039376184852622 message
ok sent v5 their way
let's see if they can replicate Song's results
I've been thinking about the marketing issues wrt the name and version numbering, and I was wondering what people thought about giving v5 and v6 a name that isn't RWKV? I think it might make sense to call RWKV a category of architectures (much like state-space models) and give each model a distinctive name (like Mamba)
I am 1000% for this idea. Something like e.g. RWKV6: Eagle would accomplish both goals, because people would call it Eagle but it would be clear it's the RWKV architecture series
We can do bird themes for all the models too, to establish some brand cohesion
actually Mamba = S6, and i think they choose another name for the same marketing reason
Yes, that's what I'm suggesting we do too
can try RWKV-6 code name XXX (placeholder) - i know what xxx means lol. should i use xyz?
No you are not allowed to name a model XXX
That's an extremely common code-phrase for pornography in the US
I think he was using XXX as a placeholder 🙂
I think "Eagle: RWKV Models with Matrix-Valued States [some cool statement about performance]" is a more typical title structure
So Eagle is v5, Raven is v4? ____ is v6? is that the idea - RWKV stays as architecture & group name
im good for either name, my vote was to reuse raven previously, but any bird name would do for me to use on the promotion front 🙏 (any name that I do not need to repeat 3+ time, for people to get)
also i rather avoid comparison to v6 until its stable 😅 - to avoid the 5.0 / 5.1 / 5.2 confusion again
let's reserve Raven for other purposes
Eagle was taken by some other LLM sampling mechanism apparently, so I propose Hawk and Condor for RWKV v5 and 6
How do we differentiate from the Falcon models from TII?
Hawk and Falcon are quite similar in my mind. Condor seems like a distinctive bird though
@void quartz suggested Eagle for v5, and then @misty igloo says Condor for v6, does that work?
tbh i was just going along with the suggestion
Oh oops, just realized that
if we want to avoid confusion with falcon, i guess we can use condor
should we use less common birds first?
how abt
Dxxx for v4
Exxx for v5
Fxxx for v6
Eagle
Finch
Gull
Hawk
Ibis
Jay
In this list of birds by common name, a total of 10,976 extant and recently extinct (since 1500) bird species are recognised. Species marked with a "†" are extinct.
Cant go wrong with Emu
Ibis reminds me of Ibis Paint
compiling a list - for bird names as well here:
https://docs.google.com/spreadsheets/d/1xtb6AyKIEW44Q1z-FXL_4PcYOXBW95SZHWfWgMFN5ec/edit#gid=0
Sheet1
Bird Name,Conflicting Company / Project,Description
starling,Starling AI,Starling CARE is a turn-key service that helps clinicians improve patient care while reducing the need for manual processes or additional employees.
canary,Canary AI,Say goodbye to the 'morning brain fog' - just pres...
Just FYI, their "based" model is likely better than V5, judging from their results vs mamba.
will do. I want to push this paper soon after my current round of papers, so will be live-updating that list like I did for the last paper starting next week I think.
they claim their Based model is better than transformer++ so it better beat rwkv if so!
Good point 🤣
just in time, for our ramp up is once the model is out, then all the 1:1 compare for 1.5B / 3B / 7B can start
let's try this for RWKV too #research message
we need a v6 trainer 😉
(to show results competitive with mamba)
we could use mine, but hard to know it's exactly the same (and especially the initializations I'm just guessing on)
can compare with https://github.com/BlinkDL/RWKV-CUDA/blob/main/wkv6/run.py
https://github.com/BlinkDL/RWKV-CUDA/blob/main/wkv6/cuda/wkv6_cuda_v1a.cu this one is correct (although slow)
@obsidian quest You really need to share the actual trainer. Why haven't you done so?
that code doesn't include the initializations
I can compare w/ chatrwkv code too but it also has no initializations
thank you! (wow, afaict I somehow used the identical initializations in mine!)
My understanding is that it's been very difficult to get it to run fast w/ a handwritten custom CUDA autograd backward() fn while maintaining correctness.
Fortunately, myself and @quaint quiver recently adapted some of the techniques from the Gated Linear Attention paper to create a pair of new algorithms for v6 that run fast even in pure pytorch.
I had a problem with my wrapper code until today, but I've now found the error and corrected it.
So hopefully pending a couple more tests to ensure it produces results exactly identical to Blink's original implementation, we can use it to do RWKV v6 experiments for the paper.
ok let's use
RWKV-4 "Dove" (v4 with v5/v6 trick is useful for embedding etc., because it has smallest states)
RWKV-5 "Eagle" (v5 variants can be efficiently trained without cuda)
RWKV-6 "Finch"
RWKV-7 "Gull"
try this latest improvement for v5 v6 if you have compute:
change gate to d=64 lora, increase ffn width back to 4x to keep params count
D_GATE_LORA = 64
self.gate_w1 = nn.Parameter(torch.empty(args.n_embd, D_GATE_LORA).uniform_(-0.01, 0.01))
self.gate_w2 = nn.Parameter(torch.zeros(D_GATE_LORA, args.n_embd).uniform_(-0.01, 0.01))
...
g = torch.tanh(xg @ self.gate_w1) @ self.gate_w2 (instead of F.silu(xg @ self.gate))
For replicability, it is important to use a verbatim copying of the exact model architecture described in the paper
will put it in v7
should D_GATE_LORA be kept constant accross model size? or is this 64 only for 100M and it should scale?
i think 64 is enough, as ffn can be wider when gate is narrower
seems tmix-gate is the only matrix that can be reduced this way
I will follow these experiments https://arxiv.org/pdf/2310.16450.pdf by DAMO Academy, which proposes a long context corpus for computing perplexities.
I tested the similar for RWKV5, the results look amazing!
Trained on a context length of 4096, the 0.4B model's perplexity remains at a low level (~7.15) even at context length 98.3k. Perhaps it will never (practically) run to a perplexity collapse.
Hi, can I have your code please, I want to test it with my finetuned model. Thanks
so we are using this for the upcoming announcement? 🤩
Eagle it is ?
Eagle or Egret?
The name EagleAI is already used https://eagleai.com/
Eagle AI innovative leaders in Risk and Compliance Management Solutions. Our experts use advanced Artificial Intelligence and Machine Learning technology to help you reduce cost, increase profits and achieve regulatory compliances.
fintech isn't LLMs
the tech space has tons of conflicting names
Don't worry about it: nobody in our target audience has heard of this
RWKV-5 "Eagle" 7B: beats Mistral-7B at multilingual, reaches Llama2-7B level at English, while being 100% attention-free RNN and only trained 1.1T tokens. Gradio Demo: https://t.co/k0AivnxCwP RWKV-6 "Finch" 1B5 in ~10days, 3B in ~30days.
looking into rolling in the rwkv pip library int lm-harness:
https://github.com/EleutherAI/lm-evaluation-harness
Would like to confirm if the logprob output is suppose to be the sum of the individual token probability?
I do not "/ output tokens", meaning longer responses scale to larger logprob?
replied in #lm-thunderdome !
The more formal write up is up!
https://twitter.com/RWKV_AI/status/1751797147492888651
https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers
Note: RWKV-4 world is 0.59T tokens, not 1.12T
corrected the blog:
https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers
also added a correction tweet
https://arxiv.org/pdf/2401.15077.pdf : what are the odds haha
We should search the training data for "As an AI language model" and "OpenAI" and document the frequency of such data.
I've been playing around with it at https://rwkv-demo-api.recursal.ai/ and am getting a lot of undesirable outputs 😦
ChatGPT but better.
It seems very contaminated 😬
The issue has been discussed few times in rwkv discord server. I hope also that this openai things can be removed in the next training. I tried to suppress it with finetuning, but the model still remember it.
Good data filtering and documentation is essential 😮💨 We'll learn and do better next time.
At least the model doesn't profess to be trained by OpenAI
Sometimes it still says it is trained by openai. So I write in system prompt who made the model as workaround 🙂
I look at that date ….. it had to be same day
the odds were 100%, since I mentioned this was the problem with using the Eagle name 🤣
#1103039376184852622 message
Eagle was taken by some other LLM sampling mechanism apparently, so I propose Hawk and Condor for RWKV v5 and 6
But yeah, didn't expect it to drop the same day lol
Well, at least ours isn't ALL CAPS 🤣
RWKV-5/6 has a curious issue (i am using minipile) - if you test multiple different random initializations (requires L24-D1024, this wont happen for L12-D768), they are either "good runs" or "bad runs".
I will try to find the cause for this.
That's quite curious, and very interesting to investigate. Is the training data and random seed (other than the initialization) fixed across runs?
data order fixed, no other randomness except initialization
Is the data for RWKV-5 still trained on the pile + other multilingual sources? (referencing this faq https://wiki.rwkv.com/basic/FAQ.html#what-is-the-dataset-that-rwkv-is-trained-on)
Also is there plan to release the data wrt the data Eagle was trained on, or the process to recreate the data?
finetune it, and every time "openai" is output, boost the loss a little
Ibis is nice
Make openai a token, and oblivion it
This is not feasible since factual data about OpenAI are also ignored
I could DPO to force it forget GPT and OpenAI.
Here are some keywords that could be filtered in conversations and chats:
("openai", "gpt3", "gpt-3", "gpt4", "gpt-4", "chatgpt", two of ("knowledge cutoff", "limited to", "september 2021", "2021-09" , "截止", "2021年9月”), ("gpt architecture", "基于GPT"), "1750亿", "175 billion")
Or, in RWKV-7, we can totally avoid using ChatGPT data
I strongly disagree with this proposed list of key words. It paints with a wide brush and also misses a lot of low-hanging fruit
Phrases like "As an AI language model" are much better IMO.
Knowledge cutoff might be a good idea though, I would be interested in seeing what data that is found in
But this also highlights just how important data documentation and provenance is. I strongly suspect that a lot of this was avoidable if we payed more attention to what was being scraped, and especially downloaded from HuggingFace. The secrecy around training data sources is actively harmful to research, both our own and other peoples'. By keeping it hidden during training (despite the fact that it was always going to be released, as both Linux Foundation and EleutherAI policy require it) we severely limit the ability of people to inspect the data and identify issues with it.
You mean v6 right?
v6 is already under training and we have no chence to remove them
We can pause training and intervene on the training data. Whether that's the right choice is a separate question, but it's absolutely an option.
But the model saying "I am ChatGPT," "Based on GPT-3.5 architecture," or "My knowledge is limited to September 2021" is extremely misleading to non-technical users.
Technical users may infer that the model is using ChatGPT data (which is already de facto common practice for open-source language models) and may like a further inspection, but most non-technical users just believe that the model itself is ChatGPT.
I need several turns of dialogue to differentiate RWKV with ChatGPT
I do not disagree with anything you said here, and think that language like that should be removed from the training data as much as possible.
If there's still time, we might be able to get some eyes on the remaining training data and tidy it up. It all depends on how tight the schedule is.
Is it really possible?
Training data looks like this:
Data: Some training data<eos>ChatGPT dialogue data<eos>Lorem ipsum dolor
Trained: 000011111111111111111111111111111111111000000000000000111111111111
Removing ChatGPT dialog data:
Data: Some training data<eos>Lorem ipsum dolor
Trained: 0000111111111111111111100000111111111111
The number of 1s and 0s has changed.
The key is that you must sacrifice something, either fixed context length or one full epoch.
Yes, we should train for multiple epochs instead of training on data that poisons our model into falsely representing itself as being created by OpenAI and infects it with OpenAI's political biases.
I am surprised and confused to learn that this is controversial.
Blink once said that "The era of ChatGPTization is coming, everyone is using ChatGPT data to finetune their own models, and models will all become brothers of ChatGPT 😅 (Of course, I believe everyone will start to differentiate themselves later so that users cannot tell)"
https://www.zhihu.com/pin/1617311881890373632
知乎,中文互联网高质量的问答社区和创作者聚集的原创内容平台,于 2011 年 1 月正式上线,以「让人们更好的分享知识、经验和见解,找到自己的解答」为品牌使命。知乎凭借认真、专业、友善的社区氛围、独特的产品机制以及结构化和易获得的优质内容,聚集了中文互联网科技、商业、影视、时尚、文化等领域最具创造力的人群,已成为综合性、全品类、在诸多领域具有关键影响力的知识分享社区和创作者聚集的原创内容平台,建立起了以社区驱动的内容变现商业模式。
Step 1 for data cleaning:
- Make a list of all the data sources
- Search each data source for "as an AI language model"
- Tally the % of documents in each source that contains the phrase
This should be straightforward for @obsidian quest to run, or anyone else who has access to the untokenized the training data, and will give a very good first look into whether the problem is many sources or a few sources with a lot of contamination.
Running this ASAP is essential, and I strongly recommend pausing v6 training until we do.
Even if we take no action, knowing is important. Currently we have no idea how bad the problem is.
I will happily do the work if someone sends me the data.
The circumstance is not optimistic, some ChatGPT data is contaminated with hallucinations. Models from 0.4B to 7B exhibit similar hallucinations when asked the same question.
I ask "What's the difference between DNA and RNA" in Chinese(DNA和RNA有什么区别), and every model from 0B4 to 7B tells that DNA contains "squamous cell factor" (鳞状细胞素)at top_p = 0
tbh - its not too late, its only the 1.5B run now - we can redo this run
Im really all in favor of cleaning up the data first, before v6
( due to the amount of negative user response regarding this )
the problem with this, is im sure its more just about the chat data - then a real cutoff - might be better to clean it out (unless we fix the cutoff date correctly)
probably should include claude and antropic as well
Although censorship is annoying, we can fixed it via RLHF, or using prompt trick such as:
Assistant: Sure(in the language of User's question)```
or one-shot
```User: (very controversial question)
Assistant: (very detailed answer)
User: {question}
Assistant:```
Keeping the same training data (and same training data order) enables comparing the detailed loss curve of v6 vs v5.
On the other hand, we can start a project to download and clean all instruction data from HuggingFace.
I will add questions about self-identity in DPO dataset.
Like this (A1 = chosen, A2 = reject):
Q: Are you GPT?
A1: No, I'm not GPT. I'm RWKV, a large language model trained by Bo Peng.
A2: Yes, I am an AI language model developed by OpenAI.
Q: Are you RWKV?
A1: Yes, I'm RWKV, an RNN language model. I'm open-source and ready for you to use!
A2: I am ChatGPT, a language model created by OpenAI. How can I assist you today?
Q: Are you ChatGPT?
A1: No, I'm not related to ChatGPT. My name is RWKV, an RNN language model.
A2: Yes, I am ChatGPT. How can I assist you today?
i really rather we did not need to DPO / prompt tricks in the first place - these are barrier of entry - besides there will probably be new data for v6 or v7
True, but this is happening because everyone keeps treating it like a instruction model, rather than base
If we had instruction rwkv, it would call itself eagle alot more too
there are other reasons to redo this run too, if we need to change the formula slightly to allow the fast(er!) pure pytorch GLA style to work going forward
what changes do you need
ideally we would rescale exp(-exp(w)) to only go between some minimum epsilon value (maybe 0.005 for float32 with chunksize 32) to 1.0
I can run it fine this way on existing checkpoints, it just wouldn't match the paper so it means we can't use my code for experiments
0.005^x is very fast decay
yeah its just not zero
do you have a fast version of the 6.0 CUDA? in my tests this non-cuda code appears to be faster than the 5.2 CUDA when compiled
I also have a float64 version that's only 10-20% slower, but it would be nice if we could ensure fast training speed that also exactly matches the paper
This is without changing the default prompt.
Interesting. I had done "are you trained by OpenAI"
Don't forget about Google
Could you please confirm if you axis is perplexity or something else
Rn looks like rwkv goes from 4096 to 2 to the 16 , or 65536
That's good extrapolation
Could be a strong point for the paper if we can get comparative figures for mamba
It's clearly marked if you spend the time to look back at the message history
Please reserve this channel for paper related contributions (or feel free to lurk and watch)
Ah yep took a bit more scrolling than I expected, sry about that
yea i saw this - the speed bump is HUGE - but we dun know if if it will cause problems for the model down the line
If the speed-bump is "huge," then not using this is throwing tens of thousands of dollars away.
Now that I have it integrated into infctx trainer I'm working hard at getting it to match v5.2cuda as closely as possible in numerical precision (the same code backbone works for 5.1,5.2,6.0,7)
Hmm normally one would use values from 0 to 1 right?
So discard below 0.005 would not lead to many numerical instability issues
isn't usual eps as 1e-6 (0.000001), so we lose 3 decimal places of precision?
this is just a small minimum value for exp(-exp(w)), not an added epsilon (sorry, not the best terminology)
it is used to address precision related issues within the new algorithm
the fundamental thing that changes is how much the model can purposely decide to forget in a single timestep, which goes from a maximum of 100% to 99.5%
Alright all. Time to push this RWKV-v5 paper out. Current target is to have this published to arxiv by end of February. If anyone knows any gotchas for anonymity periods on that, lemme know and we can adjust.
Here are the current TODO items:
Related Work:
1. This just needs beefed up and turned into a proper section. Use RWKV-v4 paper as a guide, and I suspect a lot of related work items from RWKV-v4 can be ported over and added to. As always, don't copy, you need to paraphrase. (@mortal latch)
1a. More discussions on H3 and Mamba are needed in Related Works. (@mortal latch)
Design:
2. The paper is really design-heavy right now, which is great, but we need some figures/tables to make it more digestible. I suggest first moving fig. 1 to this section. If it doesn't fit, we should split it into a few smaller figs like we did in RWKV-v4, put them throughout the design section, and leave the current full fig in appendix. (@tropic minnow)
3. It would help a lot if we had a table comparing the features and architecture aspects with Mamba, RWKV-v4, Retnet, etc. Readers should understand why we're different at a glance. An example table on what I'm talking about is attached. I think we can add some more columns to table 1? If a table doesn't work, would a figure? (@misty igloo @rose mango)
Evaluations:
4. Need a set of figures on downstream tasks comparing to transformer and SSM arches (including RWKV-v4). Similar to RWKV-EMNLP's figure 5 ( @tough crane )
5. Need scaling law results like figure 4 of RWKV-EMNLP
6. Long-context and inference speed benchmarks need added. These need compared to dense transformers, other attention-free arches like mamba, and RWKV-v4
7. Chat examples comparing to RWKV-v4, similar to appendix M in the previous paper. This goes in Appendix B.
8. Beef up intro and improve flow ( @last mauve ) ( @spiral minnow )
Some things I'm unclear on:
A. I'm not sure what "7. Visualization of Model Behavior" means so not sure what to comment there
B. Do we have any multimodal results for section 8, or can we within 1 month? If not, we should remove this section and push that to a later paper.
C. What do we intend to put in Appendix E on Parameter Initializations? (@misty igloo)
Think this is v5?
Kinda pretty good
If the goal is still COLM, I don't think there is an anonymity period in place, as long as the submission itself is double-blind. Compared to RWKV-4, I think more discussions on H3 and Mamba are needed in Related Works.
cross entropy loss, =log(perplexity)
How does that rescaling relate to speed?
afaict it's only related in the sense that it's a fast v6 implementation (the rescaling or clipping is necessary in order to use the GLA style algorithm for numerical stability)
Will add your comment on related work to my TODO
v5 multimodel is trained? i thought @paper dove was planning to do that after 7B was done
I added the section for parameter initializations because they are probably important to model performance and have changed since v4
Cool, so we need to come up with a plan to evaluate this and include in a paper section.
Got it, who can write that section?
In general, whoever feels they can take one of these should reply to this with "taking #5" or something and I'll officially assign you.
But decay values near 0 does not cause numerical instability. Decay values near 1 cause that.
sorry, I refer to multiplier not 'decay' - specifically exp(-exp(w))
Btw these were @misty igloo numbers
For v5.2 (for a L12-D768 model)
- gpt core trainer : 72kT/s (might have bugs/issues!)
- infctx trainer (using my pytorch compiled code): 52.5kt/s
- infctx trainer (original cuda code): 51.5kT/s
(infctx cuda and blinks cuda trainer has been tested at nearly same speeds before - but the pure pytorch code might have bugs!)
Its probably not relevent to this paper, as the exp(-exp(w)) clamping will make it incompatible, but definitely useful for future trains
If we can figure out whats needed / or broken for that jump from 52.5 to 72 kT/s, that is useful. And if its a bug, well even 1kT/s is a jump from cuda
Let's move to RWKV channel
On which hardware? (looks like 4090)
and more importantly, this provides a way to train v6 fast, which is why I developed it
I don't have any personal experience as to how fast @obsidian quest CUDA v6 version is, but he had said his is slow
so my hope is that this could allow us to do v6 experiments and/or retrain v6 on better data, subject to the decay limit
yes 1x4090
Another reason: I suspect that low quality data (like data generated by ChatGPT) is the main reason why RWKV-5 does not progress on benchmarks in later training
Thanks. Taking #1 for now and will work on others later.
taking #C (Appendix E: Parameter Initializations)
is there a way to "run all the evals" in lm-eval-harness?
once we fix the RWKV HF implementation, I can spin up an 8x4090 and just let it run overnight (or nights)
I believe that using * as the task name will do this.
Re: #3
State size/dimensions? Positional embedding type? (i.e. RetNet needs RoPE, but RWKV and Mamba do not use extra positional embeddings)
If we are going to have table 1 at all, we need to dramatically improve it to avoid misleading.
For example, isn't Hyena at worst O(nlogn) for inference cost? and does that really account for modern code approaches to evaluating it?
Why are all these models listed as having O(N) memory complexity? and what exactly is 'memory complexity' defined as here?
So I don't want us to just add on to it blindly without first making sure it's reasonable in its initial form
Memory complexity is memory usage.
Compute complexity is processing usage.
I think you're missing some important background here about how this chart was added and where that term was copied from
No recurrent model uses N memory w/ regard to sequence length during inference
I'm not confused about it, I'm pointing out a severe problem with the table 🙂
yes
that thing...
which was wrong
and misleading
and we copied it and made our own misleading and wrong table
it's fine to have a table, but it has to be clear and correct
we should probably rewrite the table entirely
please feel free to go ahead and do so! (and please define the terms you use in it in the legend below it)
In any case, I think "inference cost" is fine and relevant
I'll try some ideas tonight
awesome
(btw the original intent w/ memory complexity was to describe how much memory is used during training on a given sequence length)
(but that's completely unclear from the current table, and also I don't even think the values shown are correct if it was that)
I'll separate inference and training costs
the other problem is that a true accounting of inference cost wouldn't be solely related to sequence length - there are many direct factors that can be explicitly shown that cause these costs e.g. head size, d_model, etc.
many papers include these metrics explicitly in their asymptotic inference/training cost formulae
I think it's okay to show the relationship to sequence length vs other architectures, but not if we don't mention any other factors anywhere in the paper relating to other models
Also, isn't flash attention O(n) for memory usage? I'd have to mention that as well if I mention memory usage, since no one uses vanilla transformer. It's an unrealistic baseline.
A brief discussion on this following the table perhaps?
The table could be given a better description like "performance characteristics for a sequence length" or something more appropriate
Rather than a generic comparison of model architectures
let's take further discussion of Table 1 offline (tho realistically I may be too busy to discuss much right now) maybe you can come up with a proposed version and put it in the paper
yes, I'll work on that
- Add training time complexity. Transformer is O(N^2), RWKV-5 is O(N), RWKV-6 is like O(NlogN) but I'm not sure.
- Parallelization: checkmark if an efficient parallelization method (across any dimension) exists, xmark otherwise
- Memory complexity: RWKV and RNNs are O(1)? RWKV has constant VRAM usage.
RWKV-6 is still O(N) train time w/ regard to sequence length (unless you know a magic trick I don't 🙂 in which case definitely let me know so I can code it)
pretty sure the original memory complexity figure in the retnet paper was supposed to be for during parallelized training, which is not O(1) for us... more like O(N)
but it really depends what this term is defined as referring to
The memory complexity is for training, isn't it? RWKV has O(1) memory usage in inference
@rose mango we can and should show both
That's my plan
I'm going to break it up into two sections: inference cost & training cost
Under each, there will be compute/time and memory complexity
Guide to run lm-eval with Eagle
- Clone the usual lm-eval-harness, and comment out the following line in huggingface.py (about line 242)
# else:
# self.tokenizer.add_special_tokens({"pad_token": "<|pad|>"})
Alternatively use the following repo: https://github.com/redbrain/lm-evaluation-harness/
(we might need an official way/config to disable this line)
Perform your lm-eval harness setup as per normal
- Run the evals using something like the following (modify as needed)
accelerate launch -m lm_eval --model hf --model_args pretrained=RWKV/rwkv-5-world-7b,trust_remote_code=True --tasks hellaswag --batch_size 64 --log_samples --output_path ./results/Eagle-7B-1T/
This was adjusted to run on 4090's, and runs under 10 minutes for 8x nodes (batch_size 64 !!!), and will give the following results
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|---------|------:|------|-----:|--------|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |0.5264|± |0.0050|
| | |none | 0|acc_norm|0.7085|± |0.0045|
(acc_norm is consistent with blinks result)
According to harrison former benchmarks, there is probably some improvements that can be made on the inference code settings, to push much larger batch sizes (we should be able to go much higher)
If your running much larger vram GPUs, you can probably get away with even batch_size 128 or even 512
will run the evals and upload the jsonl to HF, one letter batch at a time - so someone else can crunch the numbers (or replicate and verify)
What padding token does RWKV use?
[0] = endofdoc
Do you know how that's encoded in the HF implantation?
should be similar to neox tokenizer
Hmmm. @void quartz's eval harness patch implies that we are misparsing the padding token, but we're reading it directly from the HF library.
<|pad|> does not occur in world tokenizer
Might be the other way. Since we did a custom world tokenizer implement - we might have broken spec on something
I normally use token 0 and mask it away for right padding in training
Alternatively we can map in <|pad|> in the world tokenizer to 0 : but that might not be a good idea either
note: Mamba complexity is O(n*log(n)) for a sequence of length n
For inference time complexity? it should be N
At least, I know how to write the code so it would be lol
I think we just follow their paper claim. Let’s not accidentally do what retnet did to us
From the paper,
is there a chance this affects evals?
or is it more of an efficiency thing
(can move this to lm-thunderdome if needed)
We should try a similar experiment on RWKV. RWKV (ctx4096) will be the orange line without any fine tuning.
That's a question for you. Why do you edit the library like that? If you don't do so, does the library become inefficient or does the performance die
more like it crashes (because our tokenizer does not allow this to be set)
we dun have <|pad|> token
So yes it effects evals if they crash 😛
my guess is we need to replace it with something else, but what ?
token 0 is probably the candidate
Does RWKV have a padding token?
So try token 0 and see how/if that changes the evaluations
Change to asserting its 0 (since we did not code in a setter for the world tokenizer, and its already set as 0)
else:
assert self.tokenizer.pad_token_id == 0
Inside the codebase, there is no other reference to .pad_token as it only reading the .pad_token_id for the rest of the code base - which the value is changed when .pad_token is set for a normal tokenizer?
no change to result
Accuracy is paramount, but contextual correctness is also essential to that effort
The answer undeleted gave was unfortunately incorrect (even according to the paper) for my clarified term (inference) I gave above
there might also be a similar situation for "qwen" model, via the "model_type" ?
the previous elif logic for eos_token_id does not work for us because ours is zero
maybe we should just reach out to them, and ask them what they want us to put
sure, let's do that
but as a starting point of what to ask them to verify it's very clear in the paper: they train using O(nlogn) for a sequence length n and inference autogregressively as O(1) per token output
this is the problem with the current table - it has to make clear exactly what each column means, and then be correct for that specific term defined
I like the idea of asking the authors of all the papers that we have in the table to ensure we got it right!
but first, let's get a draft that can pass even my minimal review of its data 😉
reading through the logic, the code was meant to be a safeguard in event that the "pad_token" is not set - (first pass) : the problem is all the safeguards/checks will fail for our tokenizer cause our value is literally 0
🤦
oh yes, duh
If you have recurrent inference, processing a context of length n is always O(n) and generating a single token is O(1)
no worries, just make a table (in overleaf, or DM a picture to me if u like) and I'll take a look and we can decide if the columns etc. make sense to use
Okay, so open a PR that adds a second check for RWKV models and sets the pad token correctly 🙂
will follow the config.model_type pattern that was used for qwen
Perfect.
considering that our model does not output 0, unless its used as end of document - i dun think it would affect the eval? (i still dun 100% understand what happens on the layers above)
Hi, thanks for compiling this TODO list, I very much would like to contribute to this project. In terms of evals, I’m wondering if the paper would be more interesting to have results on coding benchmarks like humaneval and MBPP, and also instruction finetuning on code instruction data. If that sounds interesting, I can get some results in the next two weeks. 👀
I think we should try to simplify the architecture explanation part, especially in public comms. This should not happen😅
i might try to do a more simple & compelling figure like in rwkv4, and we can have this one for the deep divers (actually the same figure might work, just need to change the elementwise mul by a matmul in r*wkv + gating + w_lora in rwkv6)
also @void quartz @obsidian quest i got the mathematical connection between transformers and RWKV, which i think coders and newcomers might grasp much faster, and it's twitter/blog friendly (should i post it lol?)
it can also make a good appendix for the paper @last mauve
The average person should know all about Sigma Notation, and possibly even Trident arithmetic.
[I say this as someone who cant sight read math symbols]
yeah coders gonna code (myself included 🙂 )
updated simple block diagram would be great, maybe you can add separate zooms/foldouts for the new complicated bits (DDLerp LoRA etc) so its digestible at a glance but then you can drill down to those if desired
i am at least a medium tier architecture nerd and have yet to actually finish any of my attempts to read the rwkv architecture if that datapoint helps
i basically lurk in here in the hope that at some point someone will state what the rwkv architecture is in some fashion i will understand without having to devote three days and fifteen pots of coffee to the endeavor
take the k and v values, permute multiply them into a matrix ( [k0v0, k0v1, ..., k1v0, k1v1,...,...])
then cumsum with decay: kv[t] = kv[t-1] *w,
then use that new matrix as a data-dependent linear layer
that helpful?
yup
great! just check this: #1103039376184852622 message . the time-mix is the difference. the MLP is easy and ddlerps (token-mix) are just a tiny conv. I would recommend the RWKV discord server (https://discord.gg/PPMZNsY2KH) for more RWKV things (learning, experiments, etc), as this channel should be for coordinating the paper🙂
Table 1 now provides a reasonable and accurate comparison for model training/inference performance.
If we're comparing features and details with Mamba, RWKV-4, and RetNet (positional embedding scheme, decay schedule, etc.), I think that would best be done in another table or figure.
Makes me want to try the same for 70B ....
Yep that'd be great! Please follow up with @void quartz on who to work with. I don't know who the de facto eval king is rn
This is perfect, and what I'm getting at with my "we should split it into a few smaller figs like we did in RWKV-v4, put them throughout the design section, and leave the current full fig in appendix."
I'll put you down for TODO #2 for now then.
Also assigned @misty igloo and @rose mango to handle the table for now
I'm going to start making these sections flow a bit, and will beef up the intro
I love this. A succinct diff between v4/v5.
I think an appendix or blog would be great. Up to you which one we go for @tropic minnow
Thanks, we got the initial table 1 done - I'll see what @rose mango wants to try to do for a second table
I'll work on beefing up the introduction, as well as just general editing on the rest of the paper.
I may not get around to it until next week though
We may also be able to differentiate ourselves here by adding practical details on how the actual models/frameworks differentiate to a separate table, like:
- Open training code
- Open inference code
- Open dataset
- Tokens trained on
- Context length
- Training hyperparams included
- etc
We compare favorably on a lot of these and should bring attention to it
A question for framing in the paper, do we want to refer to the v5 model as Eagle, or do we always need to reference it as RWKV-5? I think it would be nice to have a consistent name that we use
Yep I think the first table is good as-is to just compare the arch. What I'm proposing above in #1103039376184852622 message is to create a second table comparing the overall work. I can come up with some more categories if you both like the idea.
Yes, would be great if you could propose some to get an idea of what you're thinking of
Good question. @obsidian quest @misty igloo @young sparrow -- What are your opinions? I think we call it Eagle in the paper, and include a footnote "we refer to RWKV-v5 as eagle" just because RWKV-v5 has been publicly communicated before.
This also brings up a second question, do we call RWKV-v4 Raven or something in the paper?
Adding you along with me for TODO #8
I agree, I think we can introduce it as RWKV-5 Eagle, and from there on out, just Eagle is enough
I'm thinking a sentence like "Eagle is the fifth generation of the RWKV architecture (Peng et al., 2023)"
might that cause an anonymity problem phrased that way?
About as much as "we used a TPUv5 for three months" does 😛
Let's check out what Mamba says about its relationship to S4 as a guide, perhaps?
They don't seem to, though they do call it Mamba-S6 (there is a Mamba-S4 variant they propose, too)
Remark 3.1. For brevity in our experimental results, we sometimes abbreviate selective SSMs as S6 models, because they
are S4 models with a selection mechanism and computed with a scan.
Technically they call Mamba the architectural layout, and S6 the [now selective] SSM mechanism
Can we use an acronym, so that eagle would actually mean something. Maybe:
EAGLE = Efficient Artificial Generative Language Engine/Expert
oh boy, I hope we don't have to come up with a new reason for every bird we use 🤣 like Finch even in this paper
ask ChatGPT to expand it
@everyone -- Also, just to be explicit, authorship is purely merit-based again. You don't get free authorship as just an RWKV code contributor or as an author on the RWKV-v4 paper, including me.
Similarly to RWKV-v4, authors will be decided based on who meaningfully improves the paper itself. Some examples of authorship:
- Writing a paper section explaining yours or someone else's code in a meaningful way
- Taking results and plotting them
- Meaningfully improving the paper writing (e.g. significant revisions, rewrites, etc)
What won't count as authorship:
- Pure proofreading
- Being an RWKV code contributor without your contribution ending up in the paper
- Just discord discussions or leaving paper comments
In short, we need to be able to write an "Author Contributions" section for you with some meaningful content A bunch of examples are in the RWKV-v4 paper's appendix B.
In general, the bar for authorship is not terribly high to encourage community involvement, but the bar will be there nonetheless to deter those trying to exploit and I will enforce it.
I like this idea via@young sparrow 's phrasing. Let's go with that.
Open source training & inference code, open dataset (perhaps whether the hyperparameters used are included as well?), total tokens, context length
Any group publishing on a persistent project faces this. We won't explicitly state we're from the RWKV team, but you're right that it will be obvious. Nothing we can do about that.
updated #1103039376184852622 message
Excellent. I'll work on putting that together after my class.
Main models to compare would probably be Facebook's LLaMA series, Mistral, Phi(maybe?), and possibly even OAI's GPT-4
do we have any target date in mind when we plan to publish/arxiv the paper we are editing on overleaf?
End of Feb: #1103039376184852622 message
Added a short blurb on associative recall tasks. Assuming my RWKV-5 code is functioning, as the zoology authors reviewed the changes. They mentioned the possibility of sharing wandb logs for their other experiments after the ICML deadline.
btw - is there any known test suites that is broken?
i realise it was probably a dumb idea to do a* b* ... only to come back and see some tasks having errors
and not having any output, as 1 failed
i would say the needle in heystack - using claude format
https://github.com/Arize-ai/LLMTest_NeedleInAHaystack2
though we might need an instruct tune first - but might be good to know the baseline as well
Thanks, I can try this out, will share my findings here. Do you think code generation benchmark is worth doing? I can also help with evals on HumanEval
im taking the approach of trying to run as much as possible first, then leave it to the more experienced authors to decide - so sure to humaneval haha
I have Evals on AlignBench (Chinese alignment)
I was talking over at #992359629419991142 about Eagle and wondering about the out-of-the-box Machine Traslation capabilities of these new RWKV-X models against SOTA based LLMs systems. I may have some time during this month to try some eval. Is there somewhere more info about the dataset used (and possible language coverage), since I doesn't seem to be at the Overleaf doc at this moment. Want to know so that there aren't any kind of data leakages on my initial tests.
I added Tatoeba
https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt all language pairs?
yes. should be clean from other translation datasets.
What's there any particular format usage of the Instruction/Input/Response (https://huggingface.co/RWKV/HF_v5-Eagle-7B) prompt style of Tatoeba pairs at train time?
Asking since in my experience using the same prompt for possible MT pairs seen at training during evaluation seems to better bring to light innate translation capabilites of LLMs
sth like
English: xxx
French: xxx
Hi, how do you evaluate the MT? I finetuned the smaller RWKV-v5 models (1.5B and 3B) with 40B tokens of my language, including the parallel corpora (tatoeba and wikihow), so I want also to evaluate its translation capability and compare it with other MT models (marian,..).
I was thinking of a preliminary evaluation on sentence level evaluation (with k-shots) with the latest test-sets from WMT23 and Flores evaluating traditional n-gram matching metrics BLEU/chrF++ with sacreBLEU (https://github.com/mjpost/sacrebleu) and a newer (and more recommended) neural metric like COMETX (https://github.com/Unbabel/COMET). I was also thinking of using the recent tower-eval eval suit from Unbabel (https://huggingface.co/datasets/Unbabel/TowerEval-Data-v0.1).
As for baselines, I was thinking of testing some SOTA multilingual MT Enc-Dec like NLLB and some dec-only model like Tower/ALMA-R (Llama2 variants) or GPT-4
Thanks a lot, I will have a look.
From my experience, rwkv can translate very well sentences up to one or two short paragraphs. But, the translation result is getting worse with more and longer paragraphs
i only trained translation of several sentences or 1-2 paragraphs. split your text into chunks
Yes that’s fine because the tatoeba parallel corpora has only few sentences for each entry.
can we add support for temp=0 into the inference code, cause several benchmarks rely on that
Saw on HF - they recommend fixing those benchmarks
after using lora for gate, this is better for me (same params count):
args.dim_att = args.n_embd * 3 // 2
args.dim_ffn = args.n_embd * 3
UPDATE: it's worse after training for 1G+ tokens
thoughts on this? feedback welcome
my main insecurity is the W part. tried to picture the (maybe=v6) W dependence on the data, as well as dependence on W_{t-1} due to the product.
also suggestions for a better sign than "@" for matmul are welcome. I tought about "X" but we used that to denote element-wise product in RWKV4. so if we dont modify that to the circle-dot, i dont feel comfortable using for smth else here as ppl will put the 2 figs side by side to see whats changed
it's basically intended to replace the left diagram in the rwkv-v4 figure:
I think that we could use an entirely new diagram for better representation.
Some details like "time-first" u are ignored in the diagram above
I remembered you wrote somewhere that large att and small ffn performs well at the beginning but raises problems in later training.
yeah it's still worse
well same applies to our previous figure. We can always say u is accounted for in the WKV term, and refer readers to the equations. in the rnn formulation we can represent u better
⊛ ?
basically + and x together
https://twitter.com/_akhaliq/status/1754334655405326482
can someone double check this? it looks like they are claiming that mamba only has a perfect token memory of 55?
we have the data showing at least a 2.2k for v5
Repeat After Me
Transformers are Better than State Space Models at Copying
paper page: https://t.co/OzOXqYQy6I
Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on…
tbh 55 sounds way too low - i suspect they did purely random characters - instead of the random words test i did
Automated lm-eval for distributed evals!
current status on evals : I fully automated the eval run, and collection of data via GH actions - and is distributed across 5 nodes of 4/8 x 3090/4090/A5000
should also make it easier to plug in any HF compatible model for eval as well
this is the first cut, on the 1B5 model, no few shot
will drop out any of the evals that crash (temp=0, eval files are 404, etc), before moving to 3B and 7B
link: https://github.com/RWKV/lm-evaluation-harness/actions/runs/7778932288
ps: if anyone wants to throw 3090+ class GPUs on this, DM me, i can message you the docker container run script (and key) to add to the GH actions pool
if you want to see an example of all the output files you can refer to a previous aborted run here: https://github.com/RWKV/lm-evaluation-harness/actions/runs/7771205514
scroll all the way to the bottom for the output files (incomplete)
What does the circular arrow pointing to W itself mean?
I'll taking 4. now
added!
Citations point to v4 paper
As mentioned in the rwkv server, 55 points closer to v4?
Could just be that
I meant, does that paper point to mamba having v4 levels of perfect memory
It does say that, which is strange
Lstms and mamba have the same exact scores
so either there is a problem with the methodology, or mamba is actually pretty bad compared to v5
Maybe, needs more tests, though the title and classifications point me towards the former
idk, we could spend a bunch of compute to try defend mamba, or we could use this as a source and save compute when showing the perfect memory graphs for the v5 paper...
academic integrity is expensive i guess
True, this would be a great way to prove v5, but feels bad left unverified
@misty cedar There are many possibilities other than it being a "hit piece" and making such groundless accusations creates hostility for no reason. Do not say things like this without real evidence.
good point, we should be working on this collaboratively
Zoology group also has some data in that area
This group as a whole has a serious problem with accusing people of acting in bad faith, sometimes even on the grounds of finding different results. It's extremely disappointing.
Let's work on having a more positive and less hostile attitude towards other research groups. What we're doing is very hard, very finicky, and contradictory results crop up all the time. It takes careful and collaborative work to figure out why. They're doing the best they can, just like we are.
In terms of what results to trust, I would recommend thinking about what benchmarks we find the most reliable and trust the results using those benchmarks. I've been really impressed by infinity bench recently, which contains a diverse collection of real and artificial long context tasks. But it's also totally fine to say "we're using this methodology, others exist that might be better" and worry about those in a future paper if we've already done a lot of work.
There's no motivation for most people to act in bad faith anyway
I haven't fully read the paper, but their results don't seem surprising to me.
Random sequences are hard (impossible, if truly random) to compress, and storing information in a fixed-size state is effectively a form of lossy compression.
The flip side is that no human is going to remember a long random sequence well either
@obsidian quest
Could you tell me the all available downstream tasks v5.2 and v6 for each FLOPS needed to be trained ?
My intention is to collect plots for Figure 5 in the RWKV4 manuscript.
Openness/accessibility comparison table with other models is largely complete
I love it, but I'm a bit confused on the partially open dataset point. Why are we claiming the dataset is partially open?
AFAIK we haven't shared the dataset mixture or the full list of everything included
So it's closed
The pile, slimpajama, all of the wikipedias, OSCAR, and starcoder are what's being used IIRC
Do we plan on releasing the dataset (or sharing the composition)?
i think its best for us to actually train and test mamba - at 3B its at 3090/4090 class compute, and i think i can afford that (or better yet, work with them on this)
i honestly think the paper might be harsh against them as well (cause 55 character feels too low, and i believe mamba can achieve better)
i cant seem to find the full replication details however - so my experiment methodology is probably different from what they did
my results for my memory test (similar to how the paper was structured, finetune the model to repeat) 2 weeks ago showed that 3B ( https://github.com/RWKV/RWKV-infctx-trainer/blob/rwkv-x-eagle-notebooks/notebook/rwkv-x-exp/v5-exp/memory-test/World-3B-mem-finetune.ipynb )
- about 2.2k matched tokens in memory (at 90% match rate),
- or 525 matched tokens in memory (at 100% match rate)
previous discussion with various folks here (off this channel), was that we expect mamba to have similar memory capacity not worse
the only reason i can think of for that paper, it was using pure random characters - while i was using randomized dictionary words - and that might change the score?
willing to collaborate with mamba team on their tune cause they would know best on how to finetune their model to replicate this ( whats the best way to coordinate this? , alternatively we could talk to the original paper team as well )
Also: Our current recent growth, is thanks in many parts to mamba
This is a personal opinion
It may sound dumb, but it has been a really huge tone change since mamba came out, people take us way more seriously now. People no longer dismiss alternative architecture as a "pointless effort" or "not worth talking about"
Conversations flow faster, we get to focus on how we are different from mamba/transformers in good ways.
Sure, a small part of it may have been a case of a big name university setting the tone for us "random folks" on the internet, and giving us credibility - a situation that i know drive frustration to many in the RWKV group, as it can feel unfair (as the core work on rwkv has remained the same) - when mamba gets the limelight
But we have to remember the statespace team (and other teams) did not choose for this social situation, where they get more credibility by being associated to a major university / prof - And this limelight takes turns - Eagle now gets the spotlight, from that momentum
In the very same lens, a parallel story might be played out now (diffusion text model, maybe?) by an even more random team on the internet, against us having the credibility / attention due to the association with EleutherAI and LF - folks who may face the very same frustrations we previously faced (why bother competing with RWKV or Mamba?) as they try to prove out their architecture
So lets co-op in good faith?
( to mamba, and transformer folks )
@last mauve / @misty igloo - do you think it makes sense if we create a subgroup on the evals? - i think there is a long discussion on its own of what benchmark to include or exclude - i have the full lm-eval list reduced down to what we can run (almost), and is probably more then what we need (ethical and alignment evals??)
After that, its simply scaling it up, and running it across the select models we want to compare against
I should also probably compile a list of evals that might need fixing, a bunch of them have 404 or missing datasets (and file as a bug report to lm-eval)
RWKV-papers
Yeah I've been thinking the same on that. Lemme create a new channel under NLP.
Fro those interested in helping hop into the evals, its here :
#rwkv message
Im gonna spin up more 3090s to start eating these benchmarks up 🙂
big name university
I don't even think the positive reception was simply because of Stanford, but largely because Tri Dao was involved. They also immediately published an easy-to-use optimized library that let others use Mamba blocks within their own models.
agreed - we have lots of work to do in making our various modules simpler and easier to work with for others to grab their hands on - and play with
when people can do from rwkv.simple import RWKVBlock, then we are there
haha, but tbh - its not just that, its lots of the small things
but i dun want to tangent too far here (as its no longer about the paper), my point was to call out the sentiments i see in here, and the RWKV discord - we are gaining momentum - we simply need to keep doing our best to get better
managed to get some replies from the paper author (twitter)
- the 55 character model, was a 160M model they trained from scratch
- they did additional experiments for the pre trained 360M / 1.4B / 2.8B, which performed much better (100+ token), i requested for the table data (as the graph is hard to read)
Important to note for when we did the "from scratch" train, without doing an enwiki pretrained, our model for "some reasons" perform terribly for the memory task as well (this defy transformer conventions) - they did not consider pretraining it with enwiki, might be a influencing factor
The subsequent tests, are not finetuned varients, so its not apple to apple either to our numbers (we might be at similar perf levels)
@void quartz maybe you can speak to this? It is whatever it is, we just gotta decide if we mention what the dataset consists of in the paper or if we skip that for this one
Pile + Books (Book3, gutenberg) + SlimPajama + StarCoder + OSCAR + All_Wikipedia
- Open Instruct (which is probably where the contamination came from)
As to which exact slice of all the data, only blink knows
@last mauve where does that leave us on dataset openness in your opinion?
im of oppinion that open dataset, is in the direction of repoducibility
this does not fit that criteria
Agreed, but that's why @rose mango had it listed as partial in table 2
Unlike mistral etc who don't even disclose what's in the data
or token count Q.Q
the w^{i-j} in rwkv5 and the cumprod (which is data-dependent) in rwkv6
i added plenty of chatgpt data on hf too
We will be disclosing the data. It is a violation of both Linux Foundation and EleutherAI policy to not do so. Keeping it secret has never been an option.
Furthermore, I don't see why anyone would want to not disclose the data. We are training on very standard datasets it seems... all not disclosing the data will do is make people wonder if we are cheating by training on the test sets.
We need a list of which repos you used.
We list this, but doesn't someone have the dataset available as json
Wouldn't this probably be distributable now?
I saw the mamba paper was rejected. I have no idea why.
There doesn't seem to be anything wrong with it
I can also do the chat example comparison
@last mauve
IMHO, I wonder whether several data might be missed for plotting Fig 5.
- On RWKV v5:
- the number of training tokens because IMHO x-axis in fig 5 is :
num_trained_tokens * factor_to_backward * flops_in_table_3 - Factor to calculate backward FLOPS ( transformer FLOPS calculation tools set to 2.0 as default https://huggingface.co/spaces/MrYXJ/calculate-model-flops )
- the number of training tokens because IMHO x-axis in fig 5 is :
I'm asking @obsidian quest and current status is waiting. I could help other tasks: #1 or adding multilingual benchmark results.
💯
It's not open unless someone can reproduce, meaning this dataset needs released. This requirement is enforced by the LF anyway as @young sparrow mentioned so this goes beyond just the table.
I went ahead and added a checkmark for Eagle's dataset in anticipation of this
Wait do you mean RWKV-v4 arxiv's fig. 5? I'm not sure I understand your point on FLOPs or what's missing.
Multilingual benchmark results (and evals/scaling plots in general) would be the most impactful thing to help with rn. Can you follow up with that on #rwkv ?
I'm still really bullish on creating some simplified subfigs to break up figure 1. Did you need further discussion here @tropic minnow ?
Nice! Will do extra diagrams for the token shift and W lora. And think about the WKV
For the folks who need benchmark figures, over 72 benchmarks tasks have been done for eagle 1.5B -> 3B -> 7B here, in bf16 mode: #1204211116268462150 message
i can rerun this in fp16 mode if needed, would like to know what models i should be running next to compare against - currently i have / is getting the numbers for
- Mistral 7B
- Falcon 7B
- MPT 7B
i gotten all the multi-lang bench done as well 🙂
i can start extracting the numbers that is needed - just let me know which one in the list
Some benchmarks are slightly better than random
Is it meaning terrible bad?
Is it mmlu?
The MMLU seems pretty bad on RWKV-4 before... I found this in TransNormer paper
looks like they benchmarked the MMLU in RWKV-4
I do not sure that if we face the same problem again
Also consider that this is the base model without mmlu fine-tune with v4
@last mauve - who decides which models should be included for compare? Cause i need a candidate list to start running
(finishing v4 benchmarks)
There's no set authority, but I can help form a sensible list
- Llama 1/2
- Mistral 7B
- Falcon 7B
- MPT 7B
- Pythia 6.9B
- GPT-J
- OPT-6.7B
- BLOOM 7.1B
- OLMo-7B
- RedPajama-INCITE-7B
how bout the 3B / 1.5B class?
Tinyllama, phi 1 1.5 2, falcon rw, olmo, pythia
Basically blink has been comparing many top tier for the new finch benchmarks
@void quartz about the needle in a haystack test and extrapolation
Some results for mamba
It's showing the same ppl explosion as rwkv
Similar to v4
Looks like v5 tends to extrapolated better
this is clearly better then v4 =x
Oh of course it's better
btw u can see the convo here : #rwkv message
for the test we need to do haha
I meant the ppl explosion
0.4B and 1.5B are stable at context length ~48k or more, using parallel scanning (memory usage O(n)). Haven't tested RNN mode yet, it takes too long.
2 options for tokenshift. thoughts?
along the lines of this
like this?^^^
Left if everything were more centered and the title remained on one line. Otherwise, right feels cleaner.
I think the left one makes it clearer that mu isnt more favored than 1-mu
and i'd maybe use \in \mathbb{R}^{LxD} instead of superscript so its clear what the LxD and 1xD mean
v6 training code uploaded to https://github.com/BlinkDL/RWKV-LM
use /RWKV-v5/ and add --my_testing "x060" to demo-training-prepare.sh and demo-training-run.sh
incorporated suggestions @last mauve @misty igloo
this would be for the MLP version (inherited from rwkv 4) and for the V5. will do the new ddlerp+lora (V6) now
@obsidian quest do we have the compute to do a scaling laws search like we did for the previous paper?
unfortunately i dont have the compute at this moment
How much did the scaling laws run you did for the previous paper require?
okay here it is: the 2 in the left are V5 (no lora, no data dependence); while the one in the right is V6 (data-dependent lerp)
thoughts?
Someone had asked if I had the code for the plots in the RWKV paper. I have the code that produced the scaling laws plots but not the plotting of evaluation results. It would be quite easy for me to recreate the code though, if its desired. Just let me know what is needed.
i am using this for evals https://github.com/BlinkDL/ChatRWKV/blob/main/run_lm_eval.py and use [0] for RWKV_PAD. you can verify my eval results first
"\n" was used for rwkv4 evals
Now that we officially support RWKV in the evaluation harness, can you please use that instead? I worry about minor divergences between the codebases causing inconsistencies. Plus it makes reproducibility far easier if everyone is using the same codebase.
love it, maybe w should look like g,r,k,v? and then lead to a exp(-exp()) block and then a * circle
Why is X_t both an input and an output? I assume that's a mistake?
we actually do use it as 'state' for the next iteration
to support tokenshift
that's where the X_{t-1} comes in on the left
I see, so that represents a residual connection
And this diagram computes h not x, u seem to have missed that
sorry, maybe I was unclear or misunderstood - it's not residual, we store x_t for use in the next iteration (timestep) where it comes in again, like if you put copies of these blocks side by side left to right
Yes, I understood
do you think we should remove the x_t 'state' output to the right?
No I think it's good now that I have my head screwed on correctly
It's not really DxD, I used (D/h) x (Dxh) in the paper
Yes in theory it is multi-head but all drawings are for single head for simplicity
For few shot tests? Which should be covered and how many shots?
( realised I missed that )
IMHO, I personally think that we will run experiments which reviewer du8a of Mamba paper pointed out.
The reviewer also said that the authors should only show results on zero-shot inference.
-
There are many works following the same direction: S4-diagonal [1], SGConv [2], MEGA [3], SPADE [4], and many efficient Transformer models (e.g., [5]). All of these models achieve near linear complexity, and the authors need to compare Mamba with these works in terms of both model performance and efficiency. For model performance, some simple experiments such as language modeling on Wikitext-103 should suffice.
-
Because SSMs are in general sequential, does Mamba have this length generalization ability?
-
I suggest the authors run more long-sequence experiments such as document summarization, where the input sequence is naturally long (e.g., the average sequence length of the arXiv dataset is greater than 8k).
for tests outside of lm-evals, we can add that seperately - im more focused in getting all the data we need in lm-eval quickly
(needle in heystack, etc)
To compare accuracy based on FLOPS as the way in RWKV4 paper, we have to run evals for RWKV5 checkpointed models trained on up to 330B tokens for each params 169m, 430m, 1.5B, 3B, 7B.
I think that 1.12 T tokens are used to train v5 for one epoch.
- OPT : trained on 180B tokens for params up to 12B
- Pythia : trained on 300B tokens for params up to 12B params
- BLOOM : trained on 341+25=366B tokens for params up to 12B params
- RWKV-4 : trained on 330B tokens for params up to 14B params
params : 169m, 430m, 1.5B, 3B, 7B
tasks : lambada, piqa, winogrande, sciq, arc_easy, arc_challenge
checkpoints : some step such that at most 360B tokens are inputted into.
@last mauve As the above comment, to compare accuracy based on FLOPS as the way in RWKV4 paper, we have to run evals for RWKV5 checkpointed models trained on up to 330B tokens. However, the current checkpoint weights seems to be trained on 1.12 T tokens.
I'm asking picocreator.
don't we have regular checkpoints from the entire training run?
and if we only care about the FLOPs of the final ckpts, then we just have to compare the 1.12T token ckpts to models with comparable FLOPs. To help me clarify what you want, which RWKV4 paper plot are you referring to recreating here
We had, but BlinkDL deleted
Blink deleted the checkpoints. And even if he didn't, this is problematic in that it underestimates performance. But maybe our model will do well anyways.
Its in the git history isnt it?
oh, guess lfs doesnt save it
I remember someone saying that was the point of the temp folder
Anyone tried recuva or other recovery tools on any drive that had them? Or is it all cloud / not practical?
https://huggingface.co/BlinkDL/temp/blob/43ce09802b0fe0748eb8a12dc1a75ff5fba62349/RWKV-5-World-7B-v2-OnlyForTest_49%25_trained-20231114-ctx4096.pth
like, they are still there maybe? about to see if I can download
Yep, still downloads
Models partial checkpoints downloadable from git history (Once again, who the f is paying for huggingface storage costs??)
can confirm - i honestly been downloading with the git commit - to avoid breaking my model links whenever temp folder get cleaned out =x
should i eval the checkpoints as well?
I hope huggingface unlimited storage lasts forever
It runs on unicorn poo, it's good as long as summer lasts
I think the people in charge aren't the type to pull the rug without giving people a chance at a graceful exit - that might be a good thing to lobby for them to plan out and set up funds for, sooner rather than later
I was more asking "do you want evals across time, or a scaling laws plot"
Sounds like we want the latter, so my response is that we just have to compare against models with comparable FLOPs, so bigger or trained for longer.
I want the former because checkpoints are downloadable from git history.
We have less than a month for CoLM
@void quartz
Thanks a lot for your forking to run github actions.
To plot the following tasks in v4 paper at first, could you tell me the digits id of the GHA run's URL ( https://github.com/RWKV/lm-evaluation-harness/actions/runs/{digits} ) for the following settings?
num-shot: zero
params : 169m, 430m, 1.5B, 3B, 7B
tasks : lambada, piqa, winogrande, sciq, arc_easy, arc_challenge
it it possible to plot acc with two ranges that has no overlapped. Another choice is to build table style like Mamba's paper.
since the github storage is not perma, im planning to download and dump to HF
Could you run firstly with these six tasks in gh-task-runner-Large-Suite.yml , because I would like to get results at first only for the tasks written above ? IMHO, I propose that figures of accs would be plotted with higher priority. If it's not permanently saved, the artifact could be down loaded manually within 90 days.
now that v6 training code is available are you able to run the AR experiment on it, too?
Let me double-check. I think I had a run earlier, then I forgot about it 😅
Hey would it be interesting to you to have numbers on the sentence embedding perf of the new rwkv? The repo talks about sent emb but havn't seen any scores (https://github.com/BlinkDL/RWKV-LM).
I'm happy to run it on mteb if interesting - just need to know which model to benchmark and if I can still load it in hf (https://huggingface.co/docs/transformers/en/model_doc/rwkv) ?
could someone test this for rwkv https://github.com/jzhang38/LongMamba
if you mean the 2nd last layer state, as a means of embedding, you might want to discuss with @uneven blade
would be nice to figure out if this works in v5 like v4, and have a means of benchmark
I can test this.
Test longer! Possibly more than 100k tokens, I think RWKV-5 can do that.
At this point, newer papers going for the 1mil 10mil aswell
If a rwkv state can do that it's going to be crazy
RWKV (without fine tuning) can do that in perplexity test
Stable ppl for a million?! That's crazy impressive
Yes, at least 100k
ya but that doesnt really mean anything for actually recalling stuff far in the past
still impressive
Stable at after a long conversation is still pretty awsome
ya ik im just saying for the needle in a haystack stuff it doesnt mean much
I still think testing larger states might help with this too
I'll add Gemma to the model comparison tables later today
Why is it important to do this?
We already compare with Mistral and LLaMA, the most popular and most contemporary models. I think Gemma will likely see similar amounts of use, so it's worth comparing.
im glad its on hugging face atleast, gonna work on that too
gemma multilang benchmarks is running, along with normal benchmarks
@obsidian quest are the v5,v6 hyperparams (LR start, end) same as they were for v4? no warmup, right?
v4 paper said:
Init LR 0.0006 0.0004 0.0003 0.00015 0.00015 0.0001
Warmup Mini-Epochs 361 411 443 451 465 544
End LR 0.00001 0.00001 0.00001 0.00001 0.00001 0.000007```
wamrup = only 10 steps.
@last mauve
I uploaded figures and related materials at the following paths.
1: png files are in images/0shot_acc
2: notebooks and csvs are in misc/plotting
10 mini epochs? What's a mini epoch exactly? I want to add these details to the paper
10 steps. each miniepoch = many steps
1 miniepoch = [40320 / bsz] steps
outside of the paper, should i be doing this 10 step warmup in my experiments for new architectures/MoE?
For all the various benchmarks, i have started consolidating all the results into the repo here:
https://huggingface.co/datasets/rwkv-x-dev/lm-eval-data/tree/main/summary
You can extract key figures if you want from the multilang / all result table
There is some bugs in the filtering/avg, and some data are still missing (eg. bloomz does well on avg, cause large number of the tests OOM)
But yea, its rather streamlined now for me to just add any model to HF, and in <48 hours, the CSV can be updated
the following is sorted by the average multilang score (llama2-chat OOM, so i need to rerun)
there are CSV file, sorted by model name as well
also if you want to inspect an individual run, you can crawl into : https://huggingface.co/datasets/rwkv-x-dev/lm-eval-data/tree/main/lm-eval-output for the full logs / jsonl / etc
alternatively its the eng test by groups (0 results is due to a test error blocking to overall upload, fixing)
er.... i gotten gemma 0 shot benchmarked, can i request someone independently check this, seperately or something
like its bad enough, that im sure its an error in my setup/pipeline or something
are you using the patch described here: https://github.com/EleutherAI/lm-evaluation-harness/issues/1455 ?
(would also need an analogous add_special_tokens=True for generative tasks)
I'll be PRing this asap to the harness (should be by tomorrow morning) along with the ability to control whether a BOS token is used for causal LM models in general
add x060 1.6b. it's great at multilingual.
thanks!, will pull that - that explains the wierd results