steady ether Dec 10, 2023, 4:39 AM

#

Dates here https://icml.cc/Conferences/2024/Dates

obsidian quest Dec 10, 2023, 7:51 PM

#

how's the paper going 🙂

misty igloo Dec 10, 2023, 8:33 PM

#

obsidian quest how's the paper going 🙂

i gotta flesh out and add citations for the background section

#

@gusty condor would you mind if I try some new language for the introduction? feel free to throw it out if you don't like it as much

young sparrow Dec 10, 2023, 8:34 PM

#

It doesn't have one

#

Anon periods are (mostly) unique to *CL venues

misty igloo Dec 10, 2023, 8:36 PM

#

obsidian quest how's the paper going 🙂

please feel free to look over the current draft and give us critiques or suggestions

#

either here or via comments in the overleaf itself

obsidian quest Dec 10, 2023, 8:37 PM

#

misty igloo please feel free to look over the current draft and give us critiques or suggest...

got links?

#

it's a bit unfortunate that we used "RWKV" instead of "RWKV-4" lol

misty igloo Dec 10, 2023, 8:41 PM

#

obsidian quest got links?

#1103039376184852622 message

obsidian quest Dec 10, 2023, 10:45 PM

#

misty igloo https://discord.com/channels/729741769192767510/1103039376184852622/117511298951...

Restricted, sorry you don’t have permission to load this page.

gusty condor Dec 11, 2023, 1:44 AM

#

https://www.overleaf.com/1623283552mkymjtvsnybt#bd0fc2
Keyword: RWKV-5, RWKV-6, RWKV-X, article, paper, link, overleaf
(Please pin this message to avoid searching for keywords)

Overleaf, Online LaTeX Editor

An online LaTeX editor that’s easy to use. No installation, real-time collaboration, version control, hundreds of LaTeX templates, and more.

void quartz Dec 11, 2023, 2:30 AM

#

gusty condor https://www.overleaf.com/1623283552mkymjtvsnybt#bd0fc2 Keyword: RWKV-5, RWKV-6, ...

btw i can help work on memory, and long range dependency benchmarks - since that was an area i was actively testing previously
(if it makes sense to fit it in)

gusty condor Dec 11, 2023, 2:31 AM

#

It's yours

void quartz Dec 11, 2023, 2:32 AM

#

(side: does it make sense to branch the tokenizer to its own paper?, saw that section)

gusty condor Dec 11, 2023, 2:33 AM

#

No, do you have enough information to fit that into a 8-page-long paper?

void quartz Dec 11, 2023, 2:34 AM

#

gusty condor No, do you have enough information to fit that into a 8-page-long paper?

measuring the token efficiency across multiple languages is probably NOT 8 pages 😂

subtle oak Dec 11, 2023, 2:35 AM

#

gusty condor https://www.overleaf.com/1623283552mkymjtvsnybt#bd0fc2 Keyword: RWKV-5, RWKV-6, ...

I can work for the Background 2.1 (if it is okay), I want to write a blog for this topic for a long time😂

void quartz Dec 11, 2023, 2:35 AM

#

there is a side tangent, of seeing if a model perform better with the new tokenizer in another language (and english), compared to baseline - which might add up pages

#

(we are kinda assuming it gives better results, kind of - it is more token efficient for sure)

#

cause being trie based only, flys against current convention wisdom of BPE tokenizer

subtle oak Dec 11, 2023, 2:38 AM

#

Maybe can submit a short paper instead of regular size?

#

4 pages instead of 8? If do not have enough info to fit 8 pages?

steady ether Dec 11, 2023, 2:39 AM

#

subtle oak I can work for the Background 2.1 (if it is okay), I want to write a blog for th...

Introduction and 2. Background already sort of covered the placeholder for 2.1. Rest is just formatting.

So far, it sounds like we've tried a bunch of stuff, and it worked. Adding some material on the motivation and theory behind things would be great.

subtle oak Dec 11, 2023, 2:42 AM

#

Oh I see, few lines in Sec.2 mentioned these topics, thanks!

subtle oak Dec 11, 2023, 2:43 AM

#

steady ether 1. Introduction and 2. Background already sort of covered the placeholder for 2....

Yeah, there are too many tricks on RWKV and it works well… You mean that maybe we need to think about the motivations and theory for that?

steady ether Dec 11, 2023, 2:44 AM

#

I think we're just missing something like, 'The original RWKV architecture has limitations when it comes to X, Y, and Z, so we decided to try RWKV-5 to address X and Y, and RWKV-6 to address Z.

gusty condor Dec 11, 2023, 2:44 AM

#

gusty condor https://www.overleaf.com/1623283552mkymjtvsnybt#bd0fc2 Keyword: RWKV-5, RWKV-6, ...

I wonder if we need to pin this link to avoid drowning in the message flow. There is someone who might be willing to help but couldn't find the link.

subtle oak Dec 11, 2023, 2:47 AM

#

steady ether I think we're just missing something like, 'The original RWKV architecture has l...

Oh I see. in some degree like a chronicles of RWKV? How do we evolve from RWKV-4 and why we decided to add these features with this sequence

last mauve Dec 11, 2023, 2:48 AM

#

gusty condor https://www.overleaf.com/1623283552mkymjtvsnybt#bd0fc2 Keyword: RWKV-5, RWKV-6, ...

last mauve Dec 11, 2023, 2:49 AM

#

gusty condor https://www.overleaf.com/1623283552mkymjtvsnybt#bd0fc2 Keyword: RWKV-5, RWKV-6, ...

pinned!

last mauve Dec 11, 2023, 2:50 AM

#

subtle oak Oh I see. in some degree like a chronicles of RWKV? How do we evolve from RWKV-4...

Not chronicles. Come from the standpoint of resolving the shortcomings, and don't go into the history at all. See #1103039376184852622 message

last mauve Dec 11, 2023, 2:50 AM

#

subtle oak I can work for the Background 2.1 (if it is okay), I want to write a blog for th...

Please do!

subtle oak Dec 11, 2023, 2:55 AM

#

last mauve Not chronicles. Come from the standpoint of resolving the shortcomings, and don'...

Got it! Thanks so much. Like in short words to describe “RWKV4 has these shortcomings, and we need to solve them, then describe how we use RWKV 5/6 to solve”?

last mauve Dec 11, 2023, 2:56 AM

#

subtle oak Got it! Thanks so much. Like in short words to describe “RWKV4 has these shortco...

No, that would be the design section. The background section is for getting the audience quickly up to speed on important precursor concepts from before this paper. No new designs or shortcomings of past designs should be included. Read the first RWKV paper for reference.

subtle oak Dec 11, 2023, 3:00 AM

#

Oh sorry I misunderstood. So it will function more like a traditional related work section, for introducing some previous related work while introducing concepts that will be frequently used in the following paper?

last mauve Dec 11, 2023, 3:04 AM

#

subtle oak Oh sorry I misunderstood. So it will function more like a traditional related wo...

No, related work is for comparing/contrasting your current contributions with those of others. Background is for foundational concepts that need to be understood before reading the design. Check the first RWKV paper for this.

subtle oak Dec 11, 2023, 3:08 AM

#

Oh I see there are separate parts in the first paper… I’ve never noticed here before. I think I finally got what we need in this section. thank you so much!

last mauve Dec 11, 2023, 3:26 AM

#

Ok so in comparing the arxiv-v1 and EMNLP versions of the first RWKV paper, I actually think we can just replace the current arxiv with the EMNLP version, and move directly to the RWKV-v5/v6 arch paper.

Edit: Ok, arxiv has been updated. Let's move forward with RWKV-v5/v6

last mauve Dec 11, 2023, 4:51 AM

#

High-level things that need done in the RWKV-X overleaf:

Background:
1. Subsection on RNNs (similar to first paper, but directly copy nothing. Reword at the very least)
2. Subsection on Transformers and AFT (again similar to first paper, but directly copy nothing. Reword at the least)
3. Subsection on RWKV-v4 (summarization of the first paper, with an architecture figure). Can probably retool the current section 3 header in RWKV-X at "RWKV Architecture Summary" for this, along with the start of section 3 in the EMNLP version

Related Work:
Use the first paper's related work in appendix C as a template. Remember that this is anonymous and we can't say this is our arch.
4. Reword and update related work from Appendix C as a base
5. Add any subsequent work (mamba, hyena, RWKV-v4, etc)

Design:
6. The existing subsections 4.x in RWKV-x need more explanation and we need new figures similar to Figures 2/3 from the RWKV-EMNLP

Evaluations:
7. Need a set of figures on downstream tasks comparing to transformer and SSM arches (including RWKV-v4). Similar to RWKV-EMNLP's figure 5
8. Need scaling law results like ~~figure 4 of the mamba paper~~ figure 4 of RWKV-EMNLP (see for context on why we don't want a figure like mamba)

Trained Models:
9. The existing section 5 and Table 1 in RWKV-X is pretty good. Some comments are to:
**9a ** Add a "Name" column like table 2 of RWKV-EMNLP,
9b Clarify that these equations are per-token
9c All of the subscript-5/6 should be updated to subscript-v5/v6 to make it more explicit that these refer to different arches

Several other sections need started, for which the task is "start".

#

I'm going to start by making the high-level structure a bit more clear, and make sections more contributor-friendly with TODO statements and section skeletons

last mauve Dec 11, 2023, 4:55 AM

#

last mauve High-level things that need done in the RWKV-X overleaf: Background: **1.** Sub...

misty igloo Dec 11, 2023, 5:19 AM

#

@last mauve for adding more explanation to subsections 4.x in RWKV-x do you mean that we need description of what's going on and how it works mechanically because the formulae are currently unclear, or some description in that section of the motivation for why these mechanisms were chosen?

gusty condor Dec 11, 2023, 5:41 AM

#

last mauve High-level things that need done in the RWKV-X overleaf: Background: **1.** Sub...

Section 5 was my work

last mauve Dec 11, 2023, 5:44 AM

#

misty igloo <@367104793292046338> for adding more explanation to subsections 4.x in RWKV-x d...

Both. For example, why use token shift and what does it mean intuitively? Is a figure possible? As a non-expert in RWKV-v5/v6, the raw formula is confusing without the context about how it fits into the overall model architecture and how it helps.

young sparrow Dec 11, 2023, 5:48 AM

#

last mauve High-level things that need done in the RWKV-X overleaf: Background: **1.** Sub...

8: I actually rather dislike the scaling laws plot in the mamba paper. They do not seem to perform any search for the optimal token-to-parameter ratio and instead assume that it's the same as it is for transformers. In the scaling laws plot I added to the EMNLP version, as well as both Kaplan et al. and Hoffman et al., instead we search many combinations of (parameters, tokens) and then find the optimal configuration for each FLOP value and fit the curve to that.

The reason this is problematic is that it can disadvantage models that have different optimal tradeoffs. If they were just comparing to the optimal tradeoff identified in our paper or in Hoffman et al. that would be fine as it would only disadvantage their model, but they also do this for several competitor models. This makes it impossible to know if they're hurting themselves more than they're hurting the competition.

#

That plot is meaningful as an argument that the architecture is better because for a fixed (param, token) pair the architecture outperforms others, but it's not an argument that the optimal scaling is better because it doesn't remark on the optimal scaling regime at all.

#

Put another way, it's effectively the same plot as our "average of 12 benchmarks" plot but using Pile loss instead of 12 NLP benchmarks. It's not a scaling laws plot.

last mauve Dec 11, 2023, 6:08 AM

#

young sparrow Put another way, it's effectively the same plot as our "average of 12 benchmarks...

Gotcha. Updated

obsidian quest Dec 11, 2023, 8:46 AM

#

last mauve Ok so in comparing the arxiv-v1 and EMNLP versions of the first RWKV paper, I ac...

https://arxiv.org/abs/2305.13048 seems not updated yet

arXiv.org

RWKV: Reinventing RNNs for the Transformer Era

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transfo...

#

Should we name it
RWKV-5 and RWKV-6: xxx

#

CHANGE:
In this work we present RWKV-5, which builds on the architectural improvements and learned decays from RWKV-4, as well as the matrix valued states found in Linear Transformers.
(because it was proposed in Linear Transformers, not RetNet)

#

CHANGE:
Influenced by the Retention Network (RetNet) architecture ==> Influenced by the Linear Transformer architecture

#

GroupNorm = LayerNorm for each head. So no need to say it's GroupNorm.

#

Token 257-65529: actually includes lots of languages, not just Asian. and symbols.

Moreover it's a greedy tokenizer. Faster and Easier to code.

#

We can follow this narrative:

Matrix-valued states were proposed in Linear Transformers.
RWKV = [exp. decay + token shift + AFT]
RetNet found [exp. decay + xPos + Linear Transformer] works
So RWKV 5/6 is doing [exp. decay + token shift + Linear Transformer]. We don't use any extra postional embedding.
Moreover RWKV models are much better tuned than RetNet. We can show the loss curves.

And we should compare with Mamba, GateLoop, etc.

We can make a table:

decay/gate: real-valued exp. decay, complex-valued, data-dependent etc.
positional embedding
state: RWKV4 = vector state, Mamba/SSM is like "multi-vector" state, and then we have matrix-valued states

tough crane Dec 11, 2023, 10:05 AM

#

obsidian quest We can follow this narrative: * Matrix-valued states were proposed in Linear Tra...

Moreover RWKV models are much better tuned than RetNet. We can show the loss curves.

Does this means RWKV 5/6 are better at pretraining or at fine-tuning?

obsidian quest Dec 11, 2023, 10:32 AM

#

pretraining loss curve. train from scratch on new data

tough crane Dec 11, 2023, 11:46 AM

#

obsidian quest pretraining loss curve. train from scratch on new data

Do we need to conduct Chinchilla's scaling law experiments for 200M ~ 1B (or more params) ??

obsidian quest Dec 11, 2023, 7:02 PM

#

https://github.com/BlinkDL/nanoRWKV
nanoRWKV "x051a" - does not require custom CUDA kernel to train, so it works for any GPU / CPU.

https://twitter.com/BlinkDL_AI/status/1734254476218057170

python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py
python sample.py --out_dir=out-shakespeare-char

GitHub

GitHub - BlinkDL/nanoRWKV: RWKV in nanoGPT style

RWKV in nanoGPT style. Contribute to BlinkDL/nanoRWKV development by creating an account on GitHub.

misty igloo Dec 12, 2023, 1:36 AM

#

obsidian quest CHANGE: In this work we present RWKV-5, which builds on the architectural improv...

I implemented many of these changes, though I think the introduction and implicit 'story' can still use more work.

If you could review the new parts I wrote about token shift in sections 4.1, 4.4, 4.5 that would be greatly appreciated. I tried my best to infer your rationale based on our limited discussion 🙂

obsidian quest Dec 12, 2023, 3:48 AM

#

misty igloo I implemented many of these changes, though I think the introduction and implici...

In Introduction, we can mention "RWKV-5 applies..." before "Retentive Networks..." and we should mention Mamba (dynamic data-dependent decay) after RetNet

#

Extra Silu gate is used in Mamba too

#

We can mention RWKV-5-lite as a variant without custom cuda kernel requirement for training

#

rwkv5 rwkv6 were trained with 0.001 weight decay (only for matrix-valued weights: linear, emb)

#

mamba is utilizing SRAM for similar parallelization

misty igloo Dec 12, 2023, 4:01 AM

#

obsidian quest In Introduction, we can mention "RWKV-5 applies..." before "Retentive Networks.....

I imagine this ordering was intended to both explain the progression of models over time and conclude with our contribution (I didn't originally write this particular section tho) I'm not sure we should mention RetNet in section 1 at all - imho it's better left to section 2 (Background).

#

@obsidian quest did you have any comments about the token shift descriptions? I want to make sure I'm not getting anything wrong about the rationale

obsidian quest Dec 12, 2023, 4:20 AM

#

token shift = induction head & locality a priori, similar to conv1d with kernel sz 2 too

misty igloo Dec 12, 2023, 4:28 AM

#

obsidian quest token shift = induction head & locality a priori, similar to conv1d with kernel ...

what I said in the most recent draft is that token shift makes it possible to form induction heads within a single layer, and that the v6 token shift changes allow important information to flag itself for inclusion in the data stream, while unimportant information can similarly avoid incluson

obsidian quest Dec 12, 2023, 4:32 AM

#

misty igloo what I said in the most recent draft is that token shift makes it possible to fo...

yeah

#

and we can use this_token + last_token to detect this

obsidian quest Dec 12, 2023, 7:24 AM

#

#general message
We should emphasize RWKV-2-RNN was the first to show "exponential decay is all you need"

#

can add a section in appendix for the timeline of RWKV

subtle oak Dec 12, 2023, 7:35 AM

#

Is it like a chronicle from RWKV-1 to RWKV-6? Maybe I talked this before😂

obsidian quest Dec 12, 2023, 7:41 AM

#

from https://arxiv.org/abs/2312.06635

This type of model with matrix-valued hidden states that change over time is also known as “fast weights"

yeah we should make Schmidhuber happy too 😂

arXiv.org

Gated Linear Attention Transformers with Hardware-Efficient Training

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with respect to output length) inference complexity. Recent works such as RetNet (Sun et al., 2023) and TransNormerLLM (Qin et al., 2023a) observe that adding a globa...

gusty condor Dec 13, 2023, 3:59 AM

#

obsidian quest can add a section in appendix for the timeline of RWKV

I was thinking about this idea, but @young sparrow disagreed on putting RWKV history in the paper, because that the paper aims to let readers catch up the current progress of RWKV architecture, rather than tracking the RWKV memory lane.

last mauve Dec 13, 2023, 6:58 PM

#

gusty condor I was thinking about this idea, but <@193204646687408129> disagreed on putting R...

I agree here. No histories in a paper. This would instead be a very nice blog post.

misty igloo Dec 13, 2023, 7:01 PM

#

last mauve I agree here. No histories in a paper. This would instead be a very nice blog po...

I've been somewhat caught in the cross currents here, trying to thread the needle between 'just tell us how it works/what's new' and showing where it comes from. Currently trying to fix up the background to accomodate that while keeping it appropriate for a paper.

last mauve Dec 13, 2023, 7:09 PM

#

misty igloo I've been somewhat caught in the cross currents here, trying to thread the needl...

There's a very clear distinction on what's appropriate, I think. If the secondary info (e.g. intuition from a similar study/paper like "studies on CNNs demonstrate that shallow layers learn general representations while deep layers learn specific representations [cite]") on a design feature is included to help the reader understand how/why the design feature works, that's appropriate. If the secondary info is for any other reason (e.g. to claim ownership or to give interpersonal/organizational history like "we discovered XXX in May 2023 before Mamba"), then it violates double-blind and isn't appropriate for a paper.

Anything flies for a blog post, and I encourage people to post the history and demonstrate ownership there.

#

To be clear though, we're still able to make statements in the Background and Related Work sections such as "RWKV [cite] introduced exponential decay is all you need", but they can't be excessive and they can't violate double-blind

misty igloo Dec 13, 2023, 7:14 PM

#

I think I've avoided adding anything that's inappropriate in terms of anonymity in all the sections I wrote or edited to date (of course feel free to correct me if not)
The push and pull for me is more just about what extra background we include in terms of the (often third party) developments that lead up to this combination that we call RWKV5 and 6, since I think Bo has expressed wanting that in the paper.

void quartz Dec 13, 2023, 9:24 PM

#

btw how many people here is at neurips?
(dropping by tmr)

remote elbow Dec 13, 2023, 9:26 PM

#

#1171291697561477170

last mauve Dec 13, 2023, 9:30 PM

#

void quartz btw how many people here is at neurips? (dropping by tmr)

I'll also be arriving tomorrow evening! We should meet.

void quartz Dec 13, 2023, 9:39 PM

#

last mauve I'll also be arriving tomorrow evening! We should meet.

Sure! You going for workshops? Wondering if we should stay for that

last mauve Dec 13, 2023, 9:50 PM

#

void quartz Sure! You going for workshops? Wondering if we should stay for that

Yep I'll be attending and presenting at the neural scaling laws workshop https://sites.google.com/mila.quebec/6thnslw-no/home?authuser=0

young sparrow Dec 13, 2023, 9:51 PM

#

@void quartz I'm here all week, would love to meet you

void quartz Dec 13, 2023, 10:13 PM

#

Great!, see you both from tmr morning then 🙂

last mauve Dec 13, 2023, 10:36 PM

#

To help me in writing this paper, can someone in clear terms either explain or point me to something comparing the Mamba arch with RWKV-v4? Mamba will likely be our primary competing arch and I want to be able to strongly differentiate RWKV from mamba in the background/related sections of the upcoming paper

misty cedar Dec 13, 2023, 11:42 PM

#

I guess with Based releasing benchmarks for multiprocessing using linear transformers, these graphs gain a little more relevancy.

void quartz Dec 14, 2023, 12:01 AM

#

last mauve To help me in writing this paper, can someone in clear terms either explain or p...

@misty igloo - need your help confirming on mamba (as there were changes from statespaces) - its still log(n) right?

void quartz Dec 14, 2023, 12:17 AM

#

kinda dumb: Can we do a direct counter of the RetNet paper "parallel table", with a clear definition of parallel in the v5 paper

We got rejected for the OakRidge compute grant, over a new RNN (yet to be out, so no ideas the detail), that cited that retnet paper, and said that they fixed that parallel problem, and is the reason why RWKV could not scale past 14B.

obsidian quest Dec 14, 2023, 12:25 AM

#

that is very mean of them to speak so lol

#

we can easily demonstrate the training speed of rwkv is constant regardless of ctxlen

void quartz Dec 14, 2023, 12:35 AM

#

obsidian quest that is very mean of them to speak so lol

its not "retnet", its another team thats just quoting retnet

obsidian quest Dec 14, 2023, 12:35 AM

#

void quartz its not "retnet", its another team thats just quoting retnet

i know, still it's mean to claim we could not scale past 14B lol

misty igloo Dec 14, 2023, 12:36 AM

#

void quartz <@1007072846960410685> - need your help confirming on mamba (as there were chan...

no, it's like rwkv-6 - O(1) per time step, O(N) for sequence length N

void quartz Dec 14, 2023, 12:37 AM

#

misty igloo no, it's like rwkv-6 - O(1) per time step, O(N) for sequence length N

ahhh thanks for clarifying

misty igloo Dec 14, 2023, 12:39 AM

#

void quartz ahhh thanks for clarifying

in case it helps, to quote from some stuff I put in the new paper but may not make it to the final cut:

#

Earlier SSMs were historically computed using long convolutions in $O(N\log N)$ time per sequence, but could also be formulated as recurrence relations. Recent SSMs featuring data-dependent $A$ and $B$ terms (GateLoop, Mamba) are only able to be formulated as recurrence relations. Generally, such recurrence relations can run in $O(N)$ time with respect to sequence length

silent urchinBOT Dec 14, 2023, 12:40 AM

#

Smerky

obsidian quest Dec 14, 2023, 12:40 AM

#

void quartz ahhh thanks for clarifying

all gen6 designs are same

misty igloo Dec 14, 2023, 12:51 AM

#

last mauve To help me in writing this paper, can someone in clear terms either explain or p...

RWKV-4 and Mamba are quite different, but RWKV-6 and Mamba are much more similar

Mamba follows the traditional state space mechanism (more or less) of:
$h = h {\Delta A} + x {\Delta B} \
y = h C + x D$

where dB expands x into a new dimension and dA is supposedly a diagonalized version of something theoretically complicated
(I say supposedly because their code doesn't quite match their paper and some things are unexplained)
and C reduces the hidden state back to the embedding dimension

RWKV-6 is more like

$kv = (x W_k)^T (x W_v) \
h = h w + kv \
y = r (h + kv \cdot u)$

unfortunately, I don't know of a way to clearly show the differences between these

silent urchinBOT Dec 14, 2023, 12:55 AM

#

Smerky

misty igloo Dec 14, 2023, 1:00 AM

#

last mauve To help me in writing this paper, can someone in clear terms either explain or p...

There are other differences, as well... Mamba changed the traditional transformer layout from the usual blocks of sequential Attn and FFN to a unified new kind of block that expands 2x like a FFN, then does a short kernelsize 4 1D convolution analagous to but different from rwkv's tokenshift, then does the SSM, gates, and shrinks it 2x back out like the output projection of a FFN

#

gusty condor Dec 14, 2023, 1:38 AM

#

obsidian quest we can easily demonstrate the training speed of rwkv is constant regardless of c...

Add that into paper 🙂 needs experiments

spiral minnow Dec 14, 2023, 4:15 AM

#

misty igloo RWKV-4 and Mamba are quite different, but RWKV-6 and Mamba are much more similar...

Why can't you just show those equations and say that's the difference?

misty igloo Dec 14, 2023, 5:59 AM

#

spiral minnow Why can't you just show those equations and say that's the difference?

For one thing, his original question was about the difference between rwkv-4 and mamba, which is so different it's somewhat hard to even compare them 🙂 (I showed rwkv-6 above since they're more similar)

#

I guess I'm also just not certain what Quentin's goal is in showing the differences so it's hard for me to know if that suffices 🙂 As seen above, their attention formulae have terms that are quite similar in some places... but there's a lot of nuance too, like because of the way the Mamba incoming projection replaces some of what normally would be the projection from inputs to values, and how multiplying out (k^T)(v) per head is different than just expanding the full input by a smaller new dimension via matrix dB

So despite being similar, the differences are quite complicated.

#

~~And just to add a cherry on top, the Mamba code appears NOT to quite match the paper.~~ And Bo says that the results don't match either!!! (And that the reported results must employ some secret sauce that isn't in the publicly released code)

#

Fun stuff

steady ether Dec 14, 2023, 6:35 AM

#

The authors of Mamba will be giving a community talk at NeurIPS. Those attending the conference can go and ask them questions. 😉

void quartz Dec 14, 2023, 9:10 AM

#

sadly 😦 will miss it - me & harrison - our flight got delayed till 5pm

obsidian quest Dec 14, 2023, 9:15 AM

#

misty igloo There are other differences, as well... Mamba changed the traditional transforme...

that comes from https://arxiv.org/pdf/2202.10447.pdf

obsidian quest Dec 14, 2023, 11:21 AM

#

https://twitter.com/BlinkDL_AI/status/1735258602473197721

obsidian quest Dec 14, 2023, 11:45 AM

#

please update RWKV-4 paper to use "RWKV-4" instead of RWKV 🙂

gusty condor Dec 14, 2023, 11:53 AM

#

I wonder whether title is changeable

RWKV-4: Reinventing RNNs for the Transformer Era

Anyway, my opinion is that, if we matter anonymity, then that might be a bad idea (alluding that "we" have developed RWKV-1 to 3, and aiming for 5+), but if we are already famous, then that doesn't matter a bit (like OpenAI's articles are only posted on OpenAI's website and not anywhere else).

misty igloo Dec 14, 2023, 3:20 PM

#

obsidian quest that comes from https://arxiv.org/pdf/2202.10447.pdf

yes very very similar, except mamba computes their equivalents of q&k from the expanded V rather than directly from the input

#

that's a good way to compare that part in the paper, if we want to!

void quartz Dec 14, 2023, 4:54 PM

#

gusty condor I wonder whether title is changeable ``` RWKV-4: Reinventing RNNs for the Transf...

We could add a section that explains 1-3 is done as an open source research project?

Cause this is reflective of reality

#

(Do paper reviewers expect us to change reality to fit their version schema)

misty igloo Dec 14, 2023, 5:21 PM

#

gusty condor I wonder whether title is changeable ``` RWKV-4: Reinventing RNNs for the Transf...

the double blind anonymity is only relevant during the peer review process, and the original paper already went through that

#

it's just meant to protect the review process so that there is no biased treatment of the paper i.e. for acceptance into a journal

young sparrow Dec 14, 2023, 6:08 PM

#

gusty condor I wonder whether title is changeable ``` RWKV-4: Reinventing RNNs for the Transf...

We can change it on arXiv, but I think this is a very bad idea

young sparrow Dec 14, 2023, 6:08 PM

#

obsidian quest https://twitter.com/BlinkDL_AI/status/1735258602473197721

People do not know about RWKV-6 and typically do not compare with work that doesn't even have a preprint because they don't know if it's "ready" or not. The best way to get people to compare to RWKV-6 is to write a paper about it

young sparrow Dec 14, 2023, 6:11 PM

#

gusty condor I wonder whether title is changeable ``` RWKV-4: Reinventing RNNs for the Transf...

The paper has already been published and anonymity doesn't matter

obsidian quest Dec 14, 2023, 6:15 PM

#

young sparrow People do not know about RWKV-6 and typically do not compare with work that does...

meanwhile we can only try our best to point out they are comparing with rwkv4 😂

young sparrow Dec 14, 2023, 6:15 PM

#

obsidian quest meanwhile we can only try our best to point out they are comparing with rwkv4 😂

That's not what your tweet does. Your tweet accuses them of acting in bad faith.

obsidian quest Dec 14, 2023, 6:16 PM

#

they are using this opportunity

#

they certain know the existence of rwkv 5/6 and avoid mentioning it

young sparrow Dec 14, 2023, 6:17 PM

#

I don't know that to be true and I think it's immoral to accuse them of that unless you are certain

#

Have they told you that?

obsidian quest Dec 14, 2023, 6:18 PM

#

some of them follow my twitter

young sparrow Dec 14, 2023, 6:18 PM

#

That doesn't mean that they know that the models are finished

obsidian quest Dec 14, 2023, 6:18 PM

#

rwkv5 models were released long ago

young sparrow Dec 14, 2023, 6:19 PM

#

That doesn't mean that they know that the models are finished.

And like I said, it's widely considered problematic to compare with unpublished work. Even if they know about it, they could be waiting for a paper and not trying to sneakily make themselves look good

#

Accusing them of acting in bad faith based on this evidence will only cause people to dislike you and not want to compare to your work

#

I cannot more strongly recommend that you stop doing this

sharp sonnet Dec 14, 2023, 6:21 PM

#

I agree. Unfortunately, people look for published work (or a preprint) to compare.

We have no reason to believe anyone acted in bad faith. Very likely, this happened just because the researchers may not have realized the work is finished.

young sparrow Dec 14, 2023, 6:21 PM

#

Also writing papers takes time. For all you know they finished the experiments a while ago and only just got the paper out

sharp sonnet Dec 14, 2023, 6:22 PM

#

The right steps would be publishing our preprints faster and reaching out to authors if any claim is incorrect so that they correct them (eg the parallel table)

obsidian quest Dec 14, 2023, 6:22 PM

#

It's unfortunate that we don't have as much resources

young sparrow Dec 14, 2023, 6:22 PM

#

Yes it is

obsidian quest Dec 14, 2023, 6:23 PM

#

The table in RetNet is certainly acting in bad faith, so I do think there is some hostility towards us as a potential competitor

young sparrow Dec 14, 2023, 6:24 PM

#

I agree that they're not playing nicely.

sharp sonnet Dec 14, 2023, 6:24 PM

#

I don’t know much about what happened. However, I strongly believe we should just continue doing good science

young sparrow Dec 14, 2023, 6:24 PM

#

But this is still the wrong way to go about addressing this fact

obsidian quest Dec 14, 2023, 6:24 PM

#

sharp sonnet The right steps would be publishing our preprints faster and reaching out to aut...

yeah this is certainly the best method

sharp sonnet Dec 14, 2023, 6:25 PM

#

We can add experiments correcting any of the possibly incorrect claims.

obsidian quest Dec 14, 2023, 6:25 PM

#

In the future I should make a disclaimer that my rants only represent myself and don't represent RWKV views 😂

young sparrow Dec 14, 2023, 6:27 PM

#

Isn't there a RWKV Twitter? Using that to distribute release info would be helpful on both the reputational and the advertisement front

obsidian quest Dec 14, 2023, 6:27 PM

#

young sparrow Isn't there a RWKV Twitter? Using that to distribute release info would be helpf...

yeah can always feel free to use that to criticize me 🙂

steady ether Dec 14, 2023, 6:27 PM

#

How about we release a working paper on RWKV-5? It doesn’t need to be complete.

young sparrow Dec 14, 2023, 6:28 PM

#

obsidian quest yeah can always feel free to use that to criticize me 🙂

I don't think that that would be productive and don't have access to it

obsidian quest Dec 14, 2023, 6:34 PM

#

I am the kind of people who have the tendency to sometimes break rules as long as they don't harm others (and i will take / pay for the consequences too, will not avoid them) 😂 most people will hate me

#

It's my fault that we used "RWKV" for the RWKV-4 paper, and haven't published the RWKV-5/6 paper in time. Life is harsh 😂

void quartz Dec 14, 2023, 6:41 PM

#

young sparrow Isn't there a RWKV Twitter? Using that to distribute release info would be helpf...

Let’s limit the official rwkv twitter that we are starting to completed model releases. And less so on marketing future models

void quartz Dec 14, 2023, 6:44 PM

#

steady ether How about we release a working paper on RWKV-5? It doesn’t need to be complete.

From a marketing hat point of view. If possible I would rather us push that with the 7B model launch

#

And also to not rush everyone working on it in this channel

#

So let’s aim for mid/late Jan?

#

For those who compared to v4 - we can ask them politely if they can add v5 to compare (the 1.5B / 3B models) when appropriate.

If they did so in good faith, they would be open to amend.

If they did it in bad faith, I doubt confronting them will change anything (like retnet)

misty igloo Dec 14, 2023, 6:53 PM

#

obsidian quest It's my fault that we used "RWKV" for the RWKV-4 paper, and haven't published th...

Not your fault - it's important to show the v5 7B results and they take time with limited compute resources.

#

Is the plan to publish with full v5 7B results but a more limited set of v6 results? (1.5B or maybe 3B by preprint release time)

#

@spiral minnow just wanted to note that I removed your addition of quadratic memory complexity for transformers - that has been shown to be unnecessary e.g. flashattention

void quartz Dec 14, 2023, 7:13 PM

#

misty igloo Is the plan to publish with full v5 7B results but a more limited set of v6 resu...

does it make sense to put v6 as future works?
im not sure if we would have 3B model fully ready by then

misty igloo Dec 14, 2023, 7:14 PM

#

void quartz does it make sense to put v6 as future works? im not sure if we would have 3B m...

I think either way works, but if you want people to reference the best possible results for RWKV in future papers we should put it in now
This is exactly the same situation we're facing now with people quoting numbers from v4

#

I added all the formulae and descriptions so that we wouldn't fall behind

#

Just in case we were ready - if not, that's fine and we can delay the v6 paper easily

#

since we have it all written now

void quartz Dec 14, 2023, 7:15 PM

#

i would defer to those who know the academic norms then on this, was worried it just wierd that we added v6 without all the models

#

i think another direction if we want to push against this issue

#

is we need to publish blogs

misty igloo Dec 14, 2023, 7:16 PM

#

yeah, just a question of whether a 1.5B model for v6 enough when we show up to 7B for v5

void quartz Dec 14, 2023, 7:17 PM

#

so it doesn't have the same rigor requirements for the paper, and is atleast official enough

misty igloo Dec 14, 2023, 7:17 PM

#

alternatively, we could publish a working paper for v6 - but it's probably less work for it to remain integrated into the current paper

void quartz Dec 14, 2023, 7:17 PM

#

u know what - setting up an RWKV blog has been so long on my todo - just gonna set it up via substack

#

( classic coder conflict of wanting to do it better, but not having the time )

misty igloo Dec 14, 2023, 7:18 PM

#

void quartz so it doesn't have the same rigor requirements for the paper, and is atleast off...

if you want upcoming papers to quote the rwkv-6 results, it has to be a paper and not just a blog post

#

at least then they will have to show the v6 1.5B results when comparing to their 1.5B results

void quartz Dec 14, 2023, 7:19 PM

#

i see what you mean there

#

okok that sounds good (didn't consider that part)

misty igloo Dec 14, 2023, 7:20 PM

#

we should also release a 125m model btw so people have a reference point

#

and any standard sizes people tend to use in between

#

since we can train those quickly

#

and it will help ensure that upcoming papers quote our best results

#

especially when they don't train larger versions, it's useful to have our small one shown to compare side by side

#

I'm personally in favor of keeping the two papers integrated as they are now, simply because it's less effort than making a whole new one. But I'm open to a separate rwkv-6 working paper or somesuch if our advisors think that's best!

void quartz Dec 14, 2023, 7:24 PM

#

misty igloo especially when they don't train larger versions, it's useful to have our small ...

should we train multiple v5 small models (125m??), with the combination of

pile, world v2 partial (same token count), world v2 full
gpt-neox, world tokenizer

#

so we can show the transition

#

if the result is close enough for the partial, it can close off a possible criticism that its not a fair compare with different dataset/tokenizer

misty igloo Dec 14, 2023, 7:26 PM

#

void quartz should we train multiple v5 small models (125m??), with the combination of - pil...

not sure what the training mix should be - pile is of course the most standard - but having some smaller v6 available would really improve uptake of our best models in other upcoming papers comparisons

void quartz Dec 14, 2023, 7:26 PM

#

not sure if this is useful, or a waste of resource (which is already limited)

#

the idea is just all 3 x 2 varients

misty igloo Dec 14, 2023, 7:27 PM

#

it'd be great, but I'm not trying to make more work or strain our resources... just any single 125m v6 model would probably help a lot

#

because it will force people to show it in comparisons when they only have their own small models to compare to

#

(this only helps if we publish a v6 paper tho)

obsidian quest Dec 14, 2023, 8:47 PM

#

void quartz should we train multiple v5 small models (125m??), with the combination of - pil...

can do one on pile, and one on slimpajama, when we have compute

gusty condor Dec 15, 2023, 3:25 AM

#

void quartz should we train multiple v5 small models (125m??), with the combination of - pil...

Yes, but the dataset is currently at BlinkDL, and tokenizer size counts into parameters (L12 D768 is 193M, compared for 169M for Pile)

void quartz Dec 15, 2023, 9:40 PM

#

IMO - i think the world tokenizer needs a separate paper

been speaking to multiple researchers who are doing research specifically for their nation language model (and faced tokenization issue) and are working on their own region tokenizers

and there is lots of interest in how and why we did the world tokenizer without BPE, and what would be its compression ratio be for their own respective language

#

If proven out as things progress, the "trie tokenizer" approach can end up replacing BPE - if that makes sense - and this is completely seperate from the architecture

young sparrow Dec 15, 2023, 10:21 PM

#

void quartz IMO - i think the world tokenizer needs a separate paper been speaking to multi...

This makes sense to me. I think for this paper we should spend a short subsection, maybe a couple paragraphs, describing it and then we can go into more detail in the other paper

obsidian quest Dec 15, 2023, 11:31 PM

#

void quartz IMO - i think the world tokenizer needs a separate paper been speaking to multi...

it's simply a greedy tokenizer. extremely simple to implement (trie is only for optimizations). yeah we can write a paper on this

void quartz Dec 16, 2023, 2:24 AM

#

Questions like - does it hurt evals - or learning rate was up in the air : which I could not answer accurately 😬

Intuitively the rwkv world model says it’s ok. But that’s a gut feel not a tested hypothesis

#

Using greedy tokenizers is very counterintuitive given how established BPE is

So same situation of RNN 2 years ago haha

obsidian quest Dec 16, 2023, 9:46 AM

#

void quartz Questions like - does it hurt evals - or learning rate was up in the air : which...

already proven in rwkv world models. no need to change anything. similar results.

#

because my world tokenizer respects utf-8 boundary & word boundary. this is very important

#

otherwise you can have bad tokenization (such as "aliasing")

last mauve Dec 17, 2023, 3:16 AM

#

obsidian quest https://arxiv.org/abs/2305.13048 seems not updated yet

Arxiv submissions are always delayed. Try again.

young sparrow Dec 17, 2023, 5:43 AM

#

I posted a Twitter thread about the paper update https://x.com/aieleuther/status/1736260370426114466?s=46

tough crane Dec 18, 2023, 5:34 AM

#

void quartz kinda dumb: Can we do a direct counter of the RetNet paper "parallel table", wit...

Could we retry this grant application for v5/6 ?

void quartz Dec 18, 2023, 6:06 AM

#

tough crane Could we retry this grant application for v5/6 ?

it was for v5 - and the dumbest thing was i already sent the direct github link - where the author or retnet acknowledge the definition is not about GPU parallelization

young sparrow Dec 18, 2023, 7:05 AM

#

void quartz kinda dumb: Can we do a direct counter of the RetNet paper "parallel table", wit...

Wait this is absurd levels of BS

#

Did they tell you explicitly that this is why you were rejected?

void quartz Dec 18, 2023, 7:19 AM

#

young sparrow Did they tell you explicitly that this is why you were rejected?

no, but our rep is lodging a complain on that

#

that the preferred RNN candidate, uses the retnet claims, as justification to support them over us
(there is no paper, no materials, etc for the other group)

young sparrow Dec 18, 2023, 7:24 AM

#

Does our application contain evidence to the contrary?

void quartz Dec 18, 2023, 7:24 AM

#

no - we had no idea we would had to fight that claim

#

we have provided multi-node training data - but our largest is 8 nodes?

young sparrow Dec 18, 2023, 7:25 AM

#

Not having evidence that your model scales efficiently is typically a decent reason to reject

#

You don't need to run it for long, but you absolutely need to show the ability to leverage large scale resources effectively

void quartz Dec 18, 2023, 7:26 AM

#

i see, might be why our rep is trying to settle for a smaller grant amount - to prove out leveraging large scale resources specifically

#

cause it is a chicken and egg - we cant prove we can run on 1000 nodes, till we get limited access at least

#

they did ask as a follow up (before rejection) - have we ran on 1000 nodes, do we think it will work

no, we never had such access to run at such scale
yes, as we are built on pytorch lightning for multi-node training, which has been shown to scale past a 1000 nodes for deepspeed on transformer architecture. RWKV leverages pytorch lightning and deepspeed in the same way.

they did run with us across 100 nodes (for 1 hour?), as part of the validation, but we have no proof of going beyond a 100

tough crane Dec 18, 2023, 7:55 AM

#

void quartz i see, might be why our rep is trying to settle for a smaller grant amount - to ...

Could we run some benchmark of many-nodes (> 1000 nodes) for another weak GPU infrastructures within 0.5 hour ?

steady ether Dec 18, 2023, 8:00 AM

#

void quartz it was for v5 - and the dumbest thing was i already sent the direct github link ...

This has been updated in the V5 paper. It should be pretty clear going forward.

void quartz Dec 18, 2023, 8:05 AM

#

tough crane Could we run some benchmark of many-nodes (> 1000 nodes) for another weak GPU in...

I have no idea to find anything at that scale now. Even paid AWS / gcp / azure sets really low limits for new account

steady ether Dec 18, 2023, 8:14 AM

#

void quartz I have no idea to find anything at that scale now. Even paid AWS / gcp / azure s...

What about just renting something like this:

https://github.com/oracle-devrel/picluster

void quartz Dec 18, 2023, 8:15 AM

#

hmmm would multi-node CPU count?

#

i just presumed we need to atleast put a GPU

steady ether Dec 18, 2023, 8:16 AM

#

They still have GPUs, just really weak ones.

obsidian quest Dec 18, 2023, 8:59 AM

#

steady ether This has been updated in the V5 paper. It should be pretty clear going forward.

RWKV 2/3/4/5/6 all have similar complexity

tough crane Dec 18, 2023, 10:09 AM

#

void quartz I have no idea to find anything at that scale now. Even paid AWS / gcp / azure s...

Is it possible to apply another decreased number of GPUs as step-by-step manners ?

e.g. Running benchmark from 8 nodes to 256 nodes and estimating the 1024 performance

#

plotting y-(performance, training time elapsed) and x-(nodes=8, 16, 32, ..., 256)

young sparrow Dec 18, 2023, 2:03 PM

#

@void quartz I know some people. Let me see about pulling some strings. Was your application for Frontier?

last mauve Dec 18, 2023, 3:24 PM

#

void quartz they did ask as a follow up (before rejection) - have we ran on 1000 nodes, do w...

I can run RWKV at whatever scale you need. I never knew we were limited to 8.

last mauve Dec 18, 2023, 3:27 PM

#

void quartz they did ask as a follow up (before rejection) - have we ran on 1000 nodes, do w...

I didn't know they had that followup. This could have easily been resolved. Am I not on this email chain?

void quartz Dec 18, 2023, 4:38 PM

#

young sparrow <@644428303293349888> I know some people. Let me see about pulling some strings....

Summit Plus / Frontier
The request now looks to be moved towards "Director request" for ~30,000 node hours - which can be used to prove out the scaling (and maybe do something useful with it)

misty igloo Dec 18, 2023, 4:39 PM

#

steady ether This has been updated in the V5 paper. It should be pretty clear going forward.

Why are we saying transformer has memory complexity of N^2? That's been shown to be avoidable e.g. FlashAttention
I'm not sure that saying SSMs have memory complexity of NlogN is really correct, either
And what is the N in memory complexity? Many parts of this table don't seem right to me
Also, saying that RNNs can't do multi-gpu training is very questionable... since rwkv is an RNN
Maybe you mean a specific RNN architecture like LSTM?

void quartz Dec 18, 2023, 4:40 PM

#

last mauve I didn't know they had that followup. This could have easily been resolved. Am I...

That would be my fault - didn't understand the implications when they asked for previous run history, and treated it simply reporting what has been done

void quartz Dec 18, 2023, 4:41 PM

#

last mauve I can run RWKV at whatever scale you need. I never knew we were limited to 8.

is it nvidia based? or AMD based? - i think it would be great if we can include short runs for a large model, across key scale sizes, and show their peak tokens/sec - and show a slight loss reduction

#

that can help disprove and kill off the "cannot train at scale" claim to rest

#

(ps: we had issues with the frontier AMD node scaling past 100, from what looks like node-to-node communication issues)

#

you folks probably know better at a 1000 node scale, architectually speaking since its just DDP training runs, and all of that is deepspeed - am i alright in understanding this is handled by deepspeed / pytorch lightning ?

void quartz Dec 18, 2023, 4:49 PM

#

void quartz Summit Plus / Frontier The request now looks to be moved towards "Director reque...

Clarification, it was applied officially for summitPlus, but not sure the reasons - but it look like they wanted to test scaling on frontier - and was encouraging projects to go in that direction - and we went along with it?

https://www.olcf.ornl.gov/summit-plus/

obsidian quest Dec 18, 2023, 4:50 PM

#

heard megatron is much better at scaling

void quartz Dec 18, 2023, 4:56 PM

#

obsidian quest heard megatron is much better at scaling

looks like its time to setup a new trainer - again 😂

obsidian quest Dec 18, 2023, 4:57 PM

#

ok got this PM "wait before using megatron, we will release soon a nanotron" 😂

young sparrow Dec 18, 2023, 4:59 PM

#

You'd need to write a bunch of custom code to use Megatron, since it was designed for transformers

#

If you're going to put that work in, I highly recommend using GPT-NeoX which is a similar library to Megatron with DeepSpeed support and other custom features.

#

(Or, "I highly recommend chatting with Quentin about if it would be a good idea to add..."

void quartz Dec 18, 2023, 5:08 PM

#

the GPT-NeoX codebase is significantly easier to understand then Megatron itself

young sparrow Dec 18, 2023, 5:10 PM

#

Quentin works very hard to make it so 🙂

steady ether Dec 18, 2023, 5:11 PM

#

misty igloo Why are we saying transformer has memory complexity of N^2? That's been shown to...

This was meant to provide clarification on a frequently referenced table.

Screenshot_2023-12-18_at_12.07.41_PM.png

#

Ok, I've changed to "Vanilla Transformer" and "LSTMs".

misty igloo Dec 18, 2023, 5:17 PM

#

steady ether This was meant to provide clarification on a frequently referenced table.

Ah gotcha. Didn't remember where I had seen that table before 🙂
I'll take a look back at the retnet paper, but I think placed here it's missing some context that's important. Also, saying SSM is not the same as saying H3/S4/Hyena, since Mamba is a SSM (and also probably shows that those two can now be implemented in what would be called O(N))
I'm a bit worried that copying RetNet's table may not be a great path for us.

#

I mean, I'd go as far as to say that their data in that table is extremely misleading. We don't want to do the same thing!

steady ether Dec 18, 2023, 5:21 PM

#

Yeah, that's a good point!

misty igloo Dec 18, 2023, 5:21 PM

#

This whole idea of long-sequence memory complexity that they claim is kind of a red herring. 😭

#

Maybe we can find an alternative way to point out the differences that show RWKV's benefits

#

And just to be clear, RWKV and Mamba are very similar in all these kinds of metrics. We shouldn't avoid that fact

steady ether Dec 18, 2023, 5:23 PM

#

By the way, looks like the S5 paper also has a somewhat similar table

Screenshot_2023-12-18_at_12.22.39_PM.png

misty igloo Dec 18, 2023, 5:24 PM

#

that table presents a much fairer comparison imho

#

but 'parallel' yes/no for RNNs is still pretty misleading

#

actually I think this table is wrong too haha

#

the inference column is somewhat misleading

steady ether Dec 18, 2023, 5:27 PM

#

Ah, they sort of clarified earlier:

while also being parallelisable across the sequence dimension during training.```

tough crane Dec 18, 2023, 5:29 PM

#

steady ether Ah, they sort of clarified earlier: ```S4 models are far more performant while...

MS's survey's asserts similar definition of parallelization... so that training for time clock T is possible before finishing the training past time (<T)

misty igloo Dec 18, 2023, 5:31 PM

#

Rather than copy someone else's table, let's come up with a plan for what we're trying to show in comparison and figure out how to best represent that

#

and we can make it fair, unlike retnet paper

tough crane Dec 18, 2023, 5:33 PM

#

Ofcourse, Transformer's quadratic attentions is NOT parallelizable in MS's survey's definition because of fully connected matrix multiplication along time axis 🤣

young sparrow Dec 18, 2023, 5:41 PM

#

tough crane Ofcourse, Transformer's quadratic attentions is NOT parallelizable in MS's surv...

For decoder-only models it is, if I understand you right

#

We do all of these simultaneously

#

Isn't this exactly what the "unrolling" at train-time for RWKV is for?

tough crane Dec 18, 2023, 6:19 PM

#

Isn't this exactly what the "unrolling" at train-time for RWKV is for?

I personally think it's exactly possible if we have a batch with 9 sequences in parallel.

misty igloo Dec 18, 2023, 6:21 PM

#

rwkv5.1 does this sort of matrix multiplication, but rwkv5.2 and rwkv6 CUDA kernels don't bother to parallelize across time because it's highly effective to keep everything in gpu SRAM for a huge constant time speedup and obtain excellent parallelization over the non-time dimensions

#

mamba claims to use parallel associative scan to parallelize over time as well, but I haven't evaluated it to see if they actually do that in their CUDA code (their code often mismatches their paper in other ways so I'm a bit skeptical)

#

and to be clear, the current draft skips 5.1 and only describes 5.2 and 6

tough crane Dec 18, 2023, 6:26 PM

#

misty igloo mamba claims to use parallel associative scan to parallelize over time as well, ...

I also suppose that MS's survey authors assert that parallelizing over time as well with NO batched subsequences ( 9 seqs in Stella's image )

#

If my assumption is wrong, then I'm not sure about the attached table's definition

misty igloo Dec 18, 2023, 6:28 PM

#

I don't really know what MS's survey idea is, or if it's at all reasonable, but I think we just need to try to be fair and descriptive

#

as i recall they already agreed to get rid of the training parallelization column in the next revision, according to @steady ether

tough crane Dec 18, 2023, 6:30 PM

#

misty igloo I don't really know what MS's survey idea is, or if it's at all reasonable, but ...

https://arxiv.org/pdf/2312.00678.pdf is the survey. This preprint have the same table of retnet's preprint table.

#

misty igloo Dec 18, 2023, 6:31 PM

#

tough crane https://arxiv.org/pdf/2312.00678.pdf is the survey. This preprint have the same ...

doesn't matter, see #1103039376184852622 message
they have agreed to update it and remove that column entirely

tough crane Dec 18, 2023, 6:32 PM

#

Batched Parallelization along time (RWKV-v4 and the other decoder-only models could do this type)
Single Sequence Wise Parallelization along time ( Mamba asserts this type ?? )

misty igloo Dec 18, 2023, 6:40 PM

#

to be clear, rwkv 5/6 can be implemented the same way as mamba claims w/ parallel scan - they just don't happen to be in the code released

#

I already state all of this in the draft

#

but we can certainly clean up that language if needed

#

I don't really understand what is being argued about here 🙂

tough crane Dec 18, 2023, 6:43 PM

#

misty igloo I don't really understand what is being argued about here 🙂

Ah, sorry... my intention is to clarify a list of multiple definitions of "parallelization".

Type X parallelization, Type Y parallelization...etc.. and then wnna classify if each model has the type.

misty cedar Dec 18, 2023, 6:45 PM

#

Rwkv v5 can be rewritten to be reliant on only a "cumulative sum with decay" operation for cross temporal information bleeding. V4 was the same... how do other linear models perform their operations in a way that they are more paralellizable than that? Chunked temporal information forwarding?

tough crane Dec 18, 2023, 6:45 PM

#

Model	Type X-parallel	Type Y-parallel
name1	Yes	Yes

misty igloo Dec 18, 2023, 6:45 PM

#

tough crane Ah, sorry... my intention is to clarify a list of multiple definitions of "paral...

I see. Well afaict nearly every model discussed, except certain unrelated RNNs, can be implemented with parallelization across the time dimension

misty igloo Dec 18, 2023, 6:46 PM

#

misty cedar Rwkv v5 can be rewritten to be reliant on only a "cumulative sum with decay" ope...

afaik you're right, they don't do anything more parallelizable at all
everything like this can be parallelized across time using associative scan

#

many of these models were not implemented this way, including RWKV 5.2, 6

#

but that doesn't mean they can't be if it were useful to do so

#

and afaik the only reason mamba is able to get away with doing so without a horrific constant time penalty is that they limit their effective head dimension to 16

#

but that's unrelated to computer science asymptotic time complexity calculations

misty cedar Dec 18, 2023, 6:49 PM

#

Even then, you can implement a massive triangular decay matrix, multiply it to unmixed state, then do the cumsum using a tree algorythm for max parallelism. It's technically parallel, but it's so much more efficient to just do a scan

tough crane Dec 18, 2023, 6:50 PM

#

and afaik the only reason mamba is able to get away with doing so without a horrific constant time penalty is that they limit their effective head dimension to 16

Mamba's claim seems to depend on GPU RAM size ??

misty igloo Dec 18, 2023, 6:50 PM

#

this whole discussion is just something that MS created by releasing TWO preprints with bogus analysis and false claims

#

and they have agreed to retract that part

#

so I still don't really understand the goal here for us

misty igloo Dec 18, 2023, 6:52 PM

#

tough crane > and afaik the only reason mamba is able to get away with doing so without a h...

I think you're mixing up asymptotic time complexity with running time claims

young sparrow Dec 18, 2023, 6:53 PM

#

misty igloo so I still don't really understand the goal here for us

We should probably put something out detailing our take because I don't have high confidence that they'll actually retract it.

misty igloo Dec 18, 2023, 6:54 PM

#

young sparrow We should probably put something out detailing our take because I don't have hig...

do you have a suggestion on how to approach it?

misty cedar Dec 18, 2023, 6:54 PM

#

misty igloo so I still don't really understand the goal here for us

Academic pettiness? Their incorrect claims may have negatively effected a compute grant, and "its not true unless it's in a paper" seems to be a prominent opinion

misty igloo Dec 18, 2023, 6:54 PM

#

Like, a discussion section? Or a table of some sort?

#

The problem with a table is that essentially nearly all the models have the same entries in the table, in terms of asymptotic time complexity and parallelizability across time

young sparrow Dec 18, 2023, 6:56 PM

#

misty igloo do you have a suggestion on how to approach it?

I think it's a good idea to prep a blog post that shows the different tables, explains why they're wrong / explains the issues with succinctness, and presents a corrected table.

Maybe we won't release it for a while, but it'll be good to have on hand.

misty cedar Dec 18, 2023, 6:57 PM

#

What's the academic equivilent of "as per my last email"?

"Contrary to (xyz et al, 12a section b), parralelization blah blah..."

young sparrow Dec 18, 2023, 6:59 PM

#

misty cedar What's the academic equivilent of "as per my last email"? "Contrary to (xyz et ...

"We raised this issue X months ago [link] and look forward to the promised forthcoming correction to xyz et al."

misty cedar Dec 18, 2023, 7:00 PM

#

young sparrow "We raised this issue X months ago [link] and look forward to the promised forth...

Ouch, that one stings haha

steady ether Dec 18, 2023, 7:22 PM

#

misty igloo as i recall they already agreed to get rid of the training parallelization colum...

I checked and the survey authors said late December, so let's follow up with them in the New Year and work on everything else in the meantime.

If we want to make the table unique, we can compare it directly with some other foundation models that scaled to at least 7B, e.g., RWKV vs GLM, LLaMA-2, and Mistral.

#

Hmm, a table having the same entries is a bit problematic. Agree with blog / discussion section

young sparrow Dec 18, 2023, 7:31 PM

#

steady ether Hmm, a table having the same entries is a bit problematic. Agree with blog / dis...

Why

steady ether Dec 18, 2023, 7:36 PM

#

young sparrow Why

I'm on the fence. This paper is really about introducing V5, but it's also important that we clarify training parallelization.

young sparrow Dec 18, 2023, 7:36 PM

#

I meant why is the table having the same entries problematic

steady ether Dec 18, 2023, 7:38 PM

#

There were some concerns about us using a very similar table to other papers.

young sparrow Dec 18, 2023, 7:38 PM

#

Why would that be concerning

steady ether Dec 18, 2023, 7:42 PM

#

I guess not then? I made an assumption based on an earlier conversation: #1103039376184852622 message

gusty condor Dec 19, 2023, 9:43 AM

#

misty cedar Academic pettiness? Their incorrect claims may have negatively effected a comput...

Their RetNet paper is not receiving good feedback, see https://openreview.net/forum?id=UU9Icwbhin (especially Reviewer 8FpU), where the table is questioned the most.

OpenReview

Retentive Network: A Successor to Transformer for Large Language...

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance....

tough crane Dec 19, 2023, 11:07 AM

#

gusty condor Their RetNet paper is not receiving good feedback, see https://openreview.net/fo...

The main weakness with this paper are overclaiming and lack of citations, which can be misleading for readers.

😏

#


Q1: “Impossible Triangle” is an absolute overclaim because RWKV and H3 have already demonstrated models are comparable to Transformers

A1: The claim is fair enough. The “comparable performance” means that the models achieve similar results under the same setting (e.g., #parameters, and training corpus). For example, previous comparisons use Transformers with absolute position while the compared methods benefit from relative position modeling. Moreover, in H3 paper, the comparable results are in hybrid settings (i.e., combine H3 and Transformer layers), but we don’t add any Transformer layers. We conducted various controlled experiments (with matched #parameters and using the same training corpus) to compare different architectures. We are confident that the claim holds well. The experiments in Table 4 also show that previous methods still have a big gap.

Q2: RWKV can indeed be computed in parallel.

A2: We give a clear definition on “training parallelization” in the caption of Table 1, which is discussed from the sequential perspective. “∗”: whether the training implementation is sequentially parallelized, although RWKV uses channel-wise parallelism. As stated in A1, RWKV’s performance is actually not comparable w

gusty condor Dec 19, 2023, 11:50 AM

#

Q2: RWKV can indeed be computed in parallel.

A2: We give a clear definition on “training parallelization” in the caption of Table 1, which is discussed from the sequential perspective. “∗”: whether the training implementation is sequentially parallelized, although RWKV uses channel-wise parallelism. As stated in A1, RWKV’s performance is actually not comparable with Transformers according to our experiments (i.e., same #parameters, same data, and with relative position modelings). So, the statement of RWKV in Table 1 is fair enough.

Relative position modelings hurt RWKV performance? 🤔

rose mango Dec 20, 2023, 4:45 AM

#

Wow those aren't great reviews

#

The training parallelization definition is like Internet providers offering unlimited* data

*Notice the asterisk

rose mango Dec 20, 2023, 4:49 AM

#

misty igloo ~~And just to add a cherry on top, the Mamba code appears NOT to quite match the...

secret sauce

~~torch.nn.functional.scaled_dot_product_attention~~

rose mango Dec 20, 2023, 5:02 AM

#

gusty condor ``` Q2: RWKV can indeed be computed in parallel. A2: We give a clear definition...

I know they don't help.

#

@misty igloo didn't you do a training run with/without RoPE?

misty igloo Dec 20, 2023, 5:08 PM

#

rose mango <@1007072846960410685> didn't you do a training run with/without RoPE?

I'm sure Bo tried it in the RWKV-4 era. I've tried it with and without for RWKV-5, but my implementation is not the official one, my runs were short, and certainly my results haven't been published anywhere. I wouldn't say I definitively know the answer, which is why there's currently an 'experiments needed' in the paper for this.

rose mango Dec 20, 2023, 6:37 PM

#

Also, should there be test runs on small models (<100 M) a la TinyStories?

young sparrow Dec 20, 2023, 7:42 PM

#

rose mango Also, should there be test runs on small models (<100 M) a la TinyStories?

No reason not to if we have time but I wouldn't remotely view it as a priority

rose mango Dec 20, 2023, 7:43 PM

#

Since the models aren't large, I can do it

obsidian quest Dec 20, 2023, 8:13 PM

#

lets add another column to the table: state size.

rwkv2/3/4 has the smallest state size of all models here. this is a plus in some scenarios.

it's the first and only design achieving good LM performance with such tiny states.

a rwkv4 with rwkv6 trick will be highly interesting.

young sparrow Dec 20, 2023, 8:38 PM

#

obsidian quest lets add another column to the table: state size. rwkv2/3/4 has the smallest ...

Good idea

gusty condor Dec 27, 2023, 2:28 AM

#

@misty igloo How are your experiments about RWKV's positional encoding going?

misty igloo Dec 27, 2023, 2:29 AM

#

gusty condor <@1007072846960410685> How are your experiments about RWKV's positional encoding...

I haven't done any specific ones recently - but when I use token shift with traditional MHA I still need positional encoding for it to work well

#

my historically 'best' model is one that pairs some of the parts of rwkv like token shift with more traditional MHA

gusty condor Dec 27, 2023, 3:02 AM

#

misty igloo I haven't done any specific ones recently - but when I use token shift with trad...

Yes, but RWKV's positional mechanism is more like

Short term: token shift
Long term: weight decay

misty igloo Dec 27, 2023, 3:06 AM

#

gusty condor Yes, but RWKV's positional mechanism is more like - Short term: token shift - Lo...

I'm uncertain whether or not token shift supplies local positional information but I certainly agree that weight decay is what supplies positional information over longer distances.

#

I've tried using RWKV-style weight decay with traditional MHA and in my experience it works almost as well as ALiBi

#

I have some new models that use that new Based softmax approximation alongside RWKV style decay and linear attention and it works great

obsidian quest Dec 27, 2023, 3:51 AM

#

it's a trainable alibi. should be better.

misty igloo Dec 27, 2023, 4:59 AM

#

obsidian quest it's a trainable alibi. should be better.

alibi is slightly different: additive not multiplicative~~, and linear over time rather than exponential~~

#

then again, alibi only operates per head, unlike rwkv5.2/6+

obsidian quest Dec 27, 2023, 5:04 AM

#

misty igloo alibi is slightly different: additive not multiplicative~~, and linear over time...

no. it's exponential over time

#

exp(additive) 🙂

misty igloo Dec 27, 2023, 5:05 AM

#

obsidian quest no. it's exponential over time

maybe my implementation is wrong 🙂

#

but it works great

#

this is what I meant about alibi being linear over time

#

(from their github)

#

maybe you meant the exponential part is the softmax applied to that

misty igloo Dec 27, 2023, 5:21 AM

#

obsidian quest it's a trainable alibi. should be better.

its possible my initializations were better for alibi and my trainings didnt run long enough, or maybe some other confounding factor
I wasn't specifically trying to drill down onto positional encoding at the time - just was trying to rapidly find the best mixed model for use with that Based approximation

#

(second order taylor series approximation of softmax)

obsidian quest Dec 27, 2023, 5:32 AM

#

misty igloo this is what I meant about alibi being linear over time

yeah softmax has exp

obsidian quest Dec 27, 2023, 5:32 AM

#

misty igloo its possible my initializations were better for alibi and my trainings didnt run...

true. need to run 10G tokens to see the difference

misty igloo Dec 27, 2023, 5:35 AM

#

@obsidian quest regarding what @gusty condor was saying, do you think that token shift adds short term positional information?

#

(If so, I'd like to understand that aspect better so we can include it in the paper)

obsidian quest Dec 27, 2023, 5:42 AM

#

misty igloo <@870137517020688415> regarding what <@803473343705514025> was saying, do you th...

yes. and for ICL

young sparrow Dec 31, 2023, 2:58 AM

#

misty igloo ~~And just to add a cherry on top, the Mamba code appears NOT to quite match the...

Do you know what the difference(s) are?

misty igloo Dec 31, 2023, 3:41 AM

#

young sparrow Do you know what the difference(s) are?

I reviewed the paper and code again, and rescind my earlier claim - I do think the code matches Algorithm 2 from the paper.

misty igloo Dec 31, 2023, 4:01 AM

#

young sparrow Do you know what the difference(s) are?

regarding the second half about the results not matching, here's a link to Bo discussing his findings #1109810049607532555 message

young sparrow Dec 31, 2023, 4:34 AM

#

misty igloo regarding the second half about the results not matching, here's a link to Bo di...

It's not at all clear to me what someone is supposed to glean from this TBH

rose mango Dec 31, 2023, 4:51 AM

#

young sparrow It's not at all clear to me what someone is supposed to glean from this TBH

It's late here and I only glanced quickly, but what I believe is stated is:

The linked paper is the "Gated Linear Attention Transformers" paper, which compares their new GLA architecture with Mamba based on the Mamba code on GitHub.
GLA outperforms Mamba on multiple metrics, presumably in contrast to what is stated in the Mamba paper.
For this to be the case, there must have been some trick to produce the numbers in the Mamba paper; naturally this couldn't be done in the GLA paper as it's an independent evaluation.

young sparrow Dec 31, 2023, 5:08 AM

#

rose mango It's late here and I only glanced quickly, but what I believe is stated is: 1. ...

Sure
There is no contradiction between the Mamba paper and the GLA paper as far as I can tell.
Or maybe the GLA people deliberately trained a bad Mamba paper. That's the thing about inferring bad faith on the part of other teams... it can justify just about anything.

alpine ferry Jan 2, 2024, 5:53 AM

#

this paper looks super interesting, is there still any tasks you could use another contributor? I see the paper is already out there on arxiv,etc so totally fine if its too late to join. Learnt a lot from this work tho :). Great work!

rose mango Jan 2, 2024, 6:58 AM

#

v4 is published, v5 is what's currently being worked on

gusty condor Jan 2, 2024, 7:00 AM

#

alpine ferry this paper looks super interesting, is there still any tasks you could use anoth...

Please contribute to RWKV5 paper at https://www.overleaf.com/1623283552mkymjtvsnybt#bd0fc2

Overleaf, Online LaTeX Editor

An online LaTeX editor that’s easy to use. No installation, real-time collaboration, version control, hundreds of LaTeX templates, and more.

misty cedar Jan 2, 2024, 7:19 AM

#

is it worth putting in psuedo-torch for the non-academics?
or just link to code bases on github?
eg:

super naive rwkv v5 linear attention is:
(H = heads, C = dims, B=Batch, T=Time)

k = k.reshape(B,T,H,1,C//H)
v =  v.reshape(B,T,H,C//H,1)
r = r.reshape(B,T,H,1,C//H)
kv = k@v // B, T, H, C//H, C//H 
att = kv.cumsumwithdecay(decay, dim=1)
out = matmul(att, r)
# groupnorm and output head after this

with little effort you can fuse all these operations into a single kernal to save memory and compute ( fused kernal lowers intermediary memory usage from O(C^2) to O(C) , while being parallelization along B,H,and C )

gusty condor Jan 2, 2024, 7:26 AM

#

Yes, put them in the appendix

misty igloo Jan 2, 2024, 6:31 PM

#

misty cedar is it worth putting in psuedo-torch for the non-academics? or just link to code ...

yeah let's also put in the full recurrent formulation including u term in the appendix - its only a few lines of code

#

I think we should put in 'u' (bonus) into any version we list, so it represents the actual architecture

alpine ferry Jan 6, 2024, 7:48 PM

#

do we have a potential timeline when we would like to release the paper?

young sparrow Jan 6, 2024, 8:07 PM

#

alpine ferry do we have a potential timeline when we would like to release the paper?

"As soon as possible." I think we're aiming to submit to CoLM, which doesn't have an anon period.

alpine ferry Jan 6, 2024, 8:28 PM

#

Oh nice, what a relief not having the anonymity restrictions

gusty condor Jan 7, 2024, 4:44 AM

#

RWKV-5 is not finished yet, and 1 month without progress

paper dove Jan 8, 2024, 8:04 AM

#

gusty condor Please contribute to RWKV5 paper at https://www.overleaf.com/1623283552mkymjtvsn...

I will join this

gusty condor Jan 8, 2024, 3:26 PM

#

paper dove I will join this

You can write out VisualRWKV experiments

polar atlas Jan 8, 2024, 9:55 PM

#

obsidian quest lets add another column to the table: state size. rwkv2/3/4 has the smallest ...

Actually HGRN also achieved good LM performance using a state size of 2*d_model

rose mango Jan 8, 2024, 11:34 PM

#

Let me know if there's anything specific I can help with

last mauve Jan 9, 2024, 12:32 AM

#

rose mango Let me know if there's anything specific I can help with

#1103039376184852622 message

tropic minnow Jan 9, 2024, 5:35 PM

#

``` i think we could explain more about this. probably the promp-engineering sensitivity of rwkv-raven does not apply here as these are not specifically chat models, but one would expect error distribution to be similar (associative recall, etc). In the [based] blog post [ https://hazyresearch.stanford.edu/blog/2023-12-11-zoology2-based ] the stanford/hazyresearch team showcased a pitfall example, and i expect rwkv models to behave similarly. should we refer too that

Zoology (Blogpost 2): Simple, Input-Dependent, and Sub-Quadratic Se...

steady ether Jan 12, 2024, 7:53 PM

#

tropic minnow ```Limitations: Current RWKV-5/6 models, although exhibiting a great leap in per...

I was able to reproduce some of their results using their synthetic test to train models from scratch, and it does seem that there is a noticeable gap.

I stopped because reproducing the entire experiment seemed computationally expensive.

#

This is with v4

young sparrow Jan 12, 2024, 8:29 PM

#

@steady ether How much compute would you need to fully reproduce it

steady ether Jan 12, 2024, 8:36 PM

#

A quick estimate would suggest running 8xA100 for 10 days.

VRAM doesn't appear to be as crucial, so 8xA10 should suffice

terse crag Jan 12, 2024, 10:37 PM

#

gusty condor Please contribute to RWKV5 paper at https://www.overleaf.com/1623283552mkymjtvsn...

I can help out

obsidian quest Jan 15, 2024, 8:26 AM

#

steady ether I was able to reproduce some of their results using their synthetic test to trai...

please try x051a too https://github.com/BlinkDL/nanoRWKV

v5 and v6 are very different from v4

steady ether Jan 15, 2024, 5:09 PM

#

obsidian quest please try x051a too https://github.com/BlinkDL/nanoRWKV v5 and v6 are very dif...

I integrated https://github.com/BlinkDL/RWKV-LM/commit/e254e4c22ab2a1e178a56a7f7c470fbd63a3c80c and am currently running a test. So far, it appears to be better than V4. Completing 256/512 sequence lengths will take some time.

Screenshot_2024-01-15_at_12.03.00_PM.png

#

I think that version is x052?

obsidian quest Jan 15, 2024, 5:37 PM

#

steady ether I integrated https://github.com/BlinkDL/RWKV-LM/commit/e254e4c22ab2a1e178a56a7f7...

yeah it's x052

steady ether Jan 16, 2024, 3:23 AM

#

So far V5 looks amazing

#

The paper's models for comparison.

obsidian quest Jan 16, 2024, 10:54 AM

#

and v6 is better

tropic minnow Jan 17, 2024, 5:53 PM

#

Anyone in SF atm? I’ve been offered a talk at a subquadratic attn meeting on the 25th afternoon which I plan to do virtually, but just in case

misty cedar Jan 17, 2024, 6:03 PM

#

tropic minnow Anyone in SF atm? I’ve been offered a talk at a subquadratic attn meeting on the...

Me and Eugene are

burnt cedar Jan 18, 2024, 7:17 PM

#

steady ether So far V5 looks amazing

@steady ether Is this the graph like mamba with induction heads

#

If so will it extend farther?

#

Like mamba going from 64 to a million?

steady ether Jan 18, 2024, 7:56 PM

#

Yeah, partially. It's v5.2, not the final tokenshift. So far, it slightly outperforms Mamba on the Stanford benchmark.

steady ether Jan 19, 2024, 6:44 PM

#

V5.2 testing is done (for that AR experiment). We can probably use Stanford's results for the other models to save on compute.

rose mango Jan 19, 2024, 7:55 PM

#

These results look amazing

young sparrow Jan 19, 2024, 9:26 PM

#

steady ether V5.2 testing is done (for that AR experiment). We can probably use Stanford's re...

Are we using the same eval code that they used? we should at a minimum confirm that we can reproduce their numbers

steady ether Jan 19, 2024, 9:29 PM

#

young sparrow Are we using the same eval code that they used? we should at a minimum confirm t...

Yeah, it's the identical code for V4. I messaged them and they shared the code. I might also get their Wandb logs.

#

Our v4 runs are slightly different so there's some variance there. Same goes for the previous incomplete run with their other models.

round kelp Jan 21, 2024, 7:27 AM

#

Hello everyone.

gusty condor Jan 21, 2024, 11:32 AM

#

hi Xaiat

harsh narwhal Jan 21, 2024, 4:55 PM

#

hello everyone

cloud tendon Jan 22, 2024, 3:08 AM

#

Hello everyone

undone solstice Jan 22, 2024, 11:30 PM

#

Hello everyone, I’m new to this community, but I’m eager to contribute to this project.
But, I am a bit confused about how to contribute. Do I just look at the text on overleaf and start editing them?
Also, would this paper be more interesting if it could add some evaluation or finetuning experiments on code generation tasks (like HumanEval). If so, I think I can contribute something like that.😁

misty igloo Jan 23, 2024, 12:40 AM

#

Welcome! The final RWKV5.2 7B model checkpoints should be ready around Jan 29, so many of the main experiments will have to wait on that. If you have proposals for experiments you can do that would be useful to include in the paper and can be done via from-scratch pretraining, like the one @steady ether is doing, you could get started on those now. Also, see #1103039376184852622 message for a list of items to do (many have been at least partially completed at this point)

void quartz Jan 23, 2024, 4:31 AM

#

wanted to ask - whats the best / official way to do the needle in the heystack test, as i would be looking into that - i found several repos around this - but not sure which one is favoured academically speaking

gusty condor Jan 23, 2024, 6:31 AM

#

misty igloo Welcome! The final RWKV5.2 7B model checkpoints should be ready around Jan 29, s...

Experiment:
Test perplexity over different context lengths (I think from 1 to 65536), to show that RWKV can handle and utilize longer context length than it was trained for (4096).
I need both intermediate models and corpus with very long documents.

tropic minnow Jan 23, 2024, 3:31 PM

#

gusty condor Experiment: Test perplexity over different context lengths (I think from 1 to 65...

you can probably benefit from the previous experiments @snow zealot did

#

this

#

basically these numbers, more discussion here: #1103039376184852622 message

restive swallow Jan 23, 2024, 5:19 PM

#

Hi everyone, is there a detailed todo list?

undone solstice Jan 23, 2024, 6:56 PM

#

restive swallow Hi everyone, is there a detailed todo list?

#1103039376184852622 message this is a todo list for the paper I think

misty igloo Jan 23, 2024, 6:57 PM

#

@last mauve it would be great to get an update on the todo list if you have time

restive swallow Jan 23, 2024, 7:01 PM

#

Thanks. It would be better if there is a real-time todo list.

void quartz Jan 23, 2024, 9:14 PM

#

gusty condor Experiment: Test perplexity over different context lengths (I think from 1 to 65...

My current plan for the memory test

(in progress) benchmarking the model memory size in a finetune to repeat its input
(todo) perform needle in the heystack test using a modified version of : https://github.com/Arize-ai/LLMTest_NeedleInAHaystack2/tree/main (unless someone else have a better version to use) over large context length
(todo) compile the results into the paper

gusty condor Jan 24, 2024, 6:33 AM

#

void quartz My current plan for the memory test - (in progress) benchmarking the model memor...

I think it might be better to use perplexity test on natural data, not needle on synthetic data.

void quartz Jan 24, 2024, 6:33 AM

#

gusty condor I think it might be better to use perplexity test on natural data, not needle on...

which repo should i use for that?

#

might as well just do all

gusty condor Jan 24, 2024, 7:11 AM

#

Select some long documents (length >= 65537) that is not included in the RWKV training set.
compute cross entropy loss (or perplexity) at token 1,2, ..., 65536.
See if this helps
https://github.com/Jellyfish042/uncheatable_eval

GitHub

GitHub - Jellyfish042/uncheatable_eval

Contribute to Jellyfish042/uncheatable_eval development by creating an account on GitHub.

tropic minnow Jan 24, 2024, 11:14 AM

#

got this message from "based" paper authors (stanford's attn-as-rnn-like model): We are currently running experiments for our paper and would like to include the newest architecture from the RWKV folks. do you know if the code for RWKV v6 is available? afaik there's no official open source implementation, and https://github.com/SmerkyG/gptcore/blob/main/model/experimental/rwkv6_0.py as the unofficial one, but after talking to @misty igloo we can't discard there's a bug so probably the safest is to tell them to just compare against v5?

misty igloo Jan 24, 2024, 4:03 PM

#

@obsidian quest can you give them training code for x6 so the results appear in their paper?

#

it'd be an easy way we avoid the problem we currently have where everyone keeps showing v4 as the comparison

tropic minnow Jan 24, 2024, 4:09 PM

#

misty igloo <@870137517020688415> can you give them training code for x6 so the results appe...

yes exactly

tropic minnow Jan 24, 2024, 4:09 PM

#

misty igloo it'd be an easy way we avoid the problem we currently have where everyone keeps ...

yes exactly

tropic minnow Jan 24, 2024, 4:11 PM

#

misty igloo <@870137517020688415> can you give them training code for x6 so the results appe...

but this (https://github.com/BlinkDL/ChatRWKV/blob/ea1ccf40a42338442b2c4b2323354ad214e8f9a0/rwkv_pip_package/src/rwkv/model.py#L861) is just the inference code?

GitHub

ChatRWKV/rwkv_pip_package/src/rwkv/model.py at ea1ccf40a42338442b2c...

ChatRWKV is like ChatGPT but powered by RWKV (100% RNN) language model, and open source. - BlinkDL/ChatRWKV

misty igloo Jan 24, 2024, 4:13 PM

#

yes that's inference only, unfortunately

tropic minnow Jan 24, 2024, 4:15 PM

#

@obsidian quest any chance we can give them the v6 training code? they wont test it in the scale where v6 improvements kick in probably but we would get direct comparisons to "based" arch

young sparrow Jan 24, 2024, 4:28 PM

#

Wait, do we not have a copy of the v6 training code?

obsidian quest Jan 24, 2024, 4:46 PM

#

they can use v5 as comparison (and we have models for this)

i plan to release v6 kernel together with trained v6 model in Feb

obsidian quest Jan 24, 2024, 4:47 PM

#

tropic minnow got this message from "based" paper authors (stanford's attn-as-rnn-like model):...

v5 is a strong benchmark and they can try it first

#

as shown in #1103039376184852622 message

tropic minnow Jan 24, 2024, 4:47 PM

#

obsidian quest v5 is a strong benchmark and they can try it first

ok sent v5 their way

obsidian quest Jan 24, 2024, 4:48 PM

#

let's see if they can replicate Song's results

young sparrow Jan 24, 2024, 4:50 PM

#

I've been thinking about the marketing issues wrt the name and version numbering, and I was wondering what people thought about giving v5 and v6 a name that isn't RWKV? I think it might make sense to call RWKV a category of architectures (much like state-space models) and give each model a distinctive name (like Mamba)

misty igloo Jan 24, 2024, 5:06 PM

#

young sparrow I've been thinking about the marketing issues wrt the name and version numbering...

I am 1000% for this idea. Something like e.g. RWKV6: Eagle would accomplish both goals, because people would call it Eagle but it would be clear it's the RWKV architecture series

young sparrow Jan 24, 2024, 5:23 PM

#

We can do bird themes for all the models too, to establish some brand cohesion

obsidian quest Jan 24, 2024, 5:26 PM

#

young sparrow I've been thinking about the marketing issues wrt the name and version numbering...

actually Mamba = S6, and i think they choose another name for the same marketing reason

young sparrow Jan 24, 2024, 5:30 PM

#

obsidian quest actually Mamba = S6, and i think they choose another name for the same marketing...

Yes, that's what I'm suggesting we do too

obsidian quest Jan 24, 2024, 5:39 PM

#

can try RWKV-6 code name XXX (placeholder) - i know what xxx means lol. should i use xyz?

young sparrow Jan 24, 2024, 5:42 PM

#

No you are not allowed to name a model XXX

#

That's an extremely common code-phrase for pornography in the US

misty igloo Jan 24, 2024, 5:43 PM

#

I think he was using XXX as a placeholder 🙂

young sparrow Jan 24, 2024, 5:45 PM

#

I think "Eagle: RWKV Models with Matrix-Valued States [some cool statement about performance]" is a more typical title structure

void quartz Jan 24, 2024, 6:02 PM

#

So Eagle is v5, Raven is v4? ____ is v6? is that the idea - RWKV stays as architecture & group name

#

im good for either name, my vote was to reuse raven previously, but any bird name would do for me to use on the promotion front 🙏 (any name that I do not need to repeat 3+ time, for people to get)

#

also i rather avoid comparison to v6 until its stable 😅 - to avoid the 5.0 / 5.1 / 5.2 confusion again

obsidian quest Jan 24, 2024, 6:13 PM

#

let's reserve Raven for other purposes

#

https://byjus.com/english/birds-name/

BYJUS

Birds Names - Explore List of 100 Names in English

You would probably know what a hen or a crow is. Have you heard of a hoopoe, a potoo or a rallidae? Check out this article to learn a number of bird names and a lot more about some of the common birds you see around you.

misty igloo Jan 24, 2024, 6:15 PM

#

Eagle was taken by some other LLM sampling mechanism apparently, so I propose Hawk and Condor for RWKV v5 and 6

spiral minnow Jan 24, 2024, 6:44 PM

#

How do we differentiate from the Falcon models from TII?

#

Hawk and Falcon are quite similar in my mind. Condor seems like a distinctive bird though

#

@void quartz suggested Eagle for v5, and then @misty igloo says Condor for v6, does that work?

void quartz Jan 24, 2024, 6:51 PM

#

spiral minnow <@644428303293349888> suggested Eagle for v5, and then <@1007072846960410685> s...

tbh i was just going along with the suggestion

spiral minnow Jan 24, 2024, 6:51 PM

#

void quartz tbh i was just going along with the suggestion

Oh oops, just realized that

void quartz Jan 24, 2024, 6:51 PM

#

if we want to avoid confusion with falcon, i guess we can use condor

obsidian quest Jan 24, 2024, 8:08 PM

#

should we use less common birds first?

#

how abt
Dxxx for v4
Exxx for v5
Fxxx for v6

young sparrow Jan 24, 2024, 8:13 PM

#

Eagle
Finch
Gull
Hawk
Ibis
Jay

obsidian quest Jan 24, 2024, 8:14 PM

#

https://en.wikipedia.org/wiki/List_of_birds_by_common_name

List of birds by common name

In this list of birds by common name, a total of 10,976 extant and recently extinct (since 1500) bird species are recognised. Species marked with a "†" are extinct.

misty cedar Jan 24, 2024, 8:28 PM

#

Cant go wrong with Emu

rose mango Jan 24, 2024, 8:36 PM

#

Ibis reminds me of Ibis Paint

void quartz Jan 24, 2024, 8:48 PM

#

compiling a list - for bird names as well here:
https://docs.google.com/spreadsheets/d/1xtb6AyKIEW44Q1z-FXL_4PcYOXBW95SZHWfWgMFN5ec/edit#gid=0

Google Docs

RWKV Bird Names Candidate

Sheet1

Bird Name,Conflicting Company / Project,Description
starling,Starling AI,Starling CARE is a turn-key service that helps clinicians improve patient care while reducing the need for manual processes or additional employees.
canary,Canary AI,Say goodbye to the 'morning brain fog' - just pres...

steady ether Jan 25, 2024, 3:36 AM

#

obsidian quest they can use v5 as comparison (and we have models for this) i plan to release v...

Just FYI, their "based" model is likely better than V5, judging from their results vs mamba.

#

last mauve Jan 25, 2024, 4:05 AM

#

misty igloo <@367104793292046338> it would be great to get an update on the todo list if you...

will do. I want to push this paper soon after my current round of papers, so will be live-updating that list like I did for the last paper starting next week I think.

misty igloo Jan 25, 2024, 4:06 AM

#

steady ether Just FYI, their "based" model is likely better than V5, judging from their resul...

they claim their Based model is better than transformer++ so it better beat rwkv if so!

steady ether Jan 25, 2024, 4:09 AM

#

Good point 🤣

void quartz Jan 25, 2024, 4:47 AM

#

last mauve will do. I want to push this paper soon after my current round of papers, so wil...

just in time, for our ramp up is once the model is out, then all the 1:1 compare for 1.5B / 3B / 7B can start

obsidian quest Jan 25, 2024, 6:20 AM

#

let's try this for RWKV too #research message

misty igloo Jan 25, 2024, 6:59 AM

#

obsidian quest let's try this for RWKV too https://discord.com/channels/729741769192767510/7478...

we need a v6 trainer 😉

#

(to show results competitive with mamba)

#

we could use mine, but hard to know it's exactly the same (and especially the initializations I'm just guessing on)

obsidian quest Jan 25, 2024, 7:36 AM

#

can compare with https://github.com/BlinkDL/RWKV-CUDA/blob/main/wkv6/run.py

#

https://github.com/BlinkDL/RWKV-CUDA/blob/main/wkv6/cuda/wkv6_cuda_v1a.cu this one is correct (although slow)

young sparrow Jan 25, 2024, 1:42 PM

#

@obsidian quest You really need to share the actual trainer. Why haven't you done so?

misty igloo Jan 25, 2024, 5:56 PM

#

obsidian quest can compare with https://github.com/BlinkDL/RWKV-CUDA/blob/main/wkv6/run.py

that code doesn't include the initializations

#

I can compare w/ chatrwkv code too but it also has no initializations

obsidian quest Jan 25, 2024, 6:23 PM

#

📎 message.txt

misty igloo Jan 25, 2024, 7:35 PM

#

obsidian quest

thank you! (wow, afaict I somehow used the identical initializations in mine!)

misty igloo Jan 26, 2024, 12:43 AM

#

young sparrow <@870137517020688415> You really need to share the actual trainer. Why haven't y...

My understanding is that it's been very difficult to get it to run fast w/ a handwritten custom CUDA autograd backward() fn while maintaining correctness.
Fortunately, myself and @quaint quiver recently adapted some of the techniques from the Gated Linear Attention paper to create a pair of new algorithms for v6 that run fast even in pure pytorch.
I had a problem with my wrapper code until today, but I've now found the error and corrected it.
So hopefully pending a couple more tests to ensure it produces results exactly identical to Blink's original implementation, we can use it to do RWKV v6 experiments for the paper.

obsidian quest Jan 26, 2024, 10:55 AM

#

young sparrow Eagle Finch Gull Hawk Ibis Jay

ok let's use

RWKV-4 "Dove" (v4 with v5/v6 trick is useful for embedding etc., because it has smallest states)
RWKV-5 "Eagle" (v5 variants can be efficiently trained without cuda)
RWKV-6 "Finch"
RWKV-7 "Gull"

obsidian quest Jan 26, 2024, 2:56 PM

#

try this latest improvement for v5 v6 if you have compute:
change gate to d=64 lora, increase ffn width back to 4x to keep params count

        D_GATE_LORA = 64
        self.gate_w1 = nn.Parameter(torch.empty(args.n_embd, D_GATE_LORA).uniform_(-0.01, 0.01))
        self.gate_w2 = nn.Parameter(torch.zeros(D_GATE_LORA, args.n_embd).uniform_(-0.01, 0.01))
...
        g = torch.tanh(xg @ self.gate_w1) @ self.gate_w2  (instead of F.silu(xg @ self.gate))

gusty condor Jan 26, 2024, 3:03 PM

#

For replicability, it is important to use a verbatim copying of the exact model architecture described in the paper

obsidian quest Jan 27, 2024, 4:23 AM

#

gusty condor For replicability, it is important to use a verbatim copying of the exact model ...

will put it in v7

tropic minnow Jan 27, 2024, 9:56 AM

#

obsidian quest try this latest improvement for v5 v6 if you have compute: change gate to d=64 l...

should D_GATE_LORA be kept constant accross model size? or is this 64 only for 100M and it should scale?

obsidian quest Jan 27, 2024, 10:20 AM

#

tropic minnow should D_GATE_LORA be kept constant accross model size? or is this 64 only for 1...

i think 64 is enough, as ffn can be wider when gate is narrower
seems tmix-gate is the only matrix that can be reduced this way

gusty condor Jan 27, 2024, 11:34 AM

#

I will follow these experiments https://arxiv.org/pdf/2310.16450.pdf by DAMO Academy, which proposes a long context corpus for computing perplexities.

gusty condor Jan 27, 2024, 1:56 PM

#

I tested the similar for RWKV5, the results look amazing!
Trained on a context length of 4096, the 0.4B model's perplexity remains at a low level (~7.15) even at context length 98.3k. Perhaps it will never (practically) run to a perplexity collapse.

gusty condor Jan 27, 2024, 2:15 PM

#

acoustic knoll Jan 27, 2024, 7:17 PM

#

gusty condor I tested the similar for RWKV5, the results look amazing! Trained on a context l...

Hi, can I have your code please, I want to test it with my finetuned model. Thanks

void quartz Jan 27, 2024, 9:39 PM

#

obsidian quest ok let's use ``` RWKV-4 "Dove" (v4 with v5/v6 trick is useful for embedding etc....

so we are using this for the upcoming announcement? 🤩
Eagle it is ?

young sparrow Jan 27, 2024, 9:52 PM

#

void quartz so we are using this for the upcoming announcement? 🤩 Eagle it is ?

Yup

gusty condor Jan 28, 2024, 3:38 AM

#

Eagle or Egret?

#

The name EagleAI is already used https://eagleai.com/

EagleAI- Advanced Solutions for Risk Management

web dev

EagleAI Advanced Risk and Compliance Management

Eagle AI innovative leaders in Risk and Compliance Management Solutions. Our experts use advanced Artificial Intelligence and Machine Learning technology to help you reduce cost, increase profits and achieve regulatory compliances.

rose mango Jan 28, 2024, 3:54 AM

#

gusty condor The name EagleAI is already used https://eagleai.com/

fintech isn't LLMs

#

the tech space has tons of conflicting names

young sparrow Jan 28, 2024, 4:42 AM

#

gusty condor The name EagleAI is already used https://eagleai.com/

Don't worry about it: nobody in our target audience has heard of this

obsidian quest Jan 28, 2024, 10:04 AM

#

https://twitter.com/BlinkDL_AI/status/1751542433039651304 v5 7B release

BlinkDL (@BlinkDL_AI) on X

RWKV-5 "Eagle" 7B: beats Mistral-7B at multilingual, reaches Llama2-7B level at English, while being 100% attention-free RNN and only trained 1.1T tokens. Gradio Demo: https://t.co/k0AivnxCwP RWKV-6 "Finch" 1B5 in ~10days, 3B in ~30days.

gusty condor Jan 28, 2024, 1:55 PM

#

0.4B, 1.5B respectively.

#

I need more data for accuracy

void quartz Jan 28, 2024, 4:34 PM

#

looking into rolling in the rwkv pip library int lm-harness:
https://github.com/EleutherAI/lm-evaluation-harness

Would like to confirm if the logprob output is suppose to be the sum of the individual token probability?

I do not "/ output tokens", meaning longer responses scale to larger logprob?

GitHub

GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot...

A framework for few-shot evaluation of language models. - GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models.

#

Referencing: https://github.com/EleutherAI/lm-evaluation-harness/blob/97a67d27c09857e5698cbae730750cf84cd987f3/lm_eval/models/gguf.py#L24

spring fulcrum Jan 28, 2024, 7:07 PM

#

replied in #lm-thunderdome !

void quartz Jan 29, 2024, 2:59 AM

#

The more formal write up is up!
https://twitter.com/RWKV_AI/status/1751797147492888651
https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers

RWKV (@RWKV_AI) on X

Introducing Eagle-7B

Based on the RWKV-v5 architecture, bringing into opensource space, the strongest

multi-lingual model
(beating even mistral)
attention-free transformer today
(10-100x+ lower inference)

With comparable English performance with the best 1T 7B models

Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across ...

A brand new era for the RWKV-v5 architecture and linear transformer's has arrived - with the strongest multi-lingual model in open source today

gusty condor Jan 29, 2024, 4:37 AM

#

void quartz The more formal write up is up! https://twitter.com/RWKV_AI/status/1751797147492...

Note: RWKV-4 world is 0.59T tokens, not 1.12T

void quartz Jan 29, 2024, 4:50 AM

#

ok that is my bad

#

forgot about world v1 -> v2

void quartz Jan 29, 2024, 5:05 AM

#

gusty condor Note: RWKV-4 world is 0.59T tokens, not 1.12T

corrected the blog:
https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers

also added a correction tweet

alpine ferry Jan 29, 2024, 12:54 PM

#

https://arxiv.org/pdf/2401.15077.pdf : what are the odds haha

young sparrow Jan 29, 2024, 3:56 PM

#

We should search the training data for "As an AI language model" and "OpenAI" and document the frequency of such data.

#

I've been playing around with it at https://rwkv-demo-api.recursal.ai/ and am getting a lot of undesirable outputs 😦

Chatbot UI

ChatGPT but better.

#

It seems very contaminated 😬

acoustic knoll Jan 29, 2024, 4:24 PM

#

young sparrow We should search the training data for "As an AI language model" and "OpenAI" an...

The issue has been discussed few times in rwkv discord server. I hope also that this openai things can be removed in the next training. I tried to suppress it with finetuning, but the model still remember it.

young sparrow Jan 29, 2024, 4:33 PM

#

acoustic knoll The issue has been discussed few times in rwkv discord server. I hope also that ...

Good data filtering and documentation is essential 😮‍💨 We'll learn and do better next time.

#

At least the model doesn't profess to be trained by OpenAI

acoustic knoll Jan 29, 2024, 4:42 PM

#

young sparrow At least the model doesn't profess to be trained by OpenAI

Sometimes it still says it is trained by openai. So I write in system prompt who made the model as workaround 🙂

void quartz Jan 29, 2024, 4:47 PM

#

alpine ferry https://arxiv.org/pdf/2401.15077.pdf : what are the odds haha

I look at that date ….. it had to be same day

misty igloo Jan 29, 2024, 4:58 PM

#

alpine ferry https://arxiv.org/pdf/2401.15077.pdf : what are the odds haha

the odds were 100%, since I mentioned this was the problem with using the Eagle name 🤣
#1103039376184852622 message

Eagle was taken by some other LLM sampling mechanism apparently, so I propose Hawk and Condor for RWKV v5 and 6

But yeah, didn't expect it to drop the same day lol

#

Well, at least ours isn't ALL CAPS 🤣

obsidian quest Jan 29, 2024, 5:39 PM

#

RWKV-5/6 has a curious issue (i am using minipile) - if you test multiple different random initializations (requires L24-D1024, this wont happen for L12-D768), they are either "good runs" or "bad runs".
I will try to find the cause for this.

young sparrow Jan 29, 2024, 5:44 PM

#

obsidian quest RWKV-5/6 has a curious issue (i am using minipile) - if you test multiple differ...

That's quite curious, and very interesting to investigate. Is the training data and random seed (other than the initialization) fixed across runs?

obsidian quest Jan 29, 2024, 5:47 PM

#

young sparrow That's quite curious, and very interesting to investigate. Is the training data ...

data order fixed, no other randomness except initialization

full flame Jan 29, 2024, 6:13 PM

#

Is the data for RWKV-5 still trained on the pile + other multilingual sources? (referencing this faq https://wiki.rwkv.com/basic/FAQ.html#what-is-the-dataset-that-rwkv-is-trained-on)

Also is there plan to release the data wrt the data Eagle was trained on, or the process to recreate the data?

Frequently Ask Questions

rose mango Jan 29, 2024, 7:21 PM

#

acoustic knoll The issue has been discussed few times in rwkv discord server. I hope also that ...

finetune it, and every time "openai" is output, boost the loss a little

burnt cedar Jan 29, 2024, 8:01 PM

#

young sparrow Eagle Finch Gull Hawk Ibis Jay

Ibis is nice

burnt cedar Jan 29, 2024, 8:01 PM

#

rose mango finetune it, and every time "openai" is output, boost the loss a little

Make openai a token, and oblivion it

gusty condor Jan 30, 2024, 12:07 AM

#

This is not feasible since factual data about OpenAI are also ignored

#

I could DPO to force it forget GPT and OpenAI.

#

Here are some keywords that could be filtered in conversations and chats:

("openai", "gpt3", "gpt-3", "gpt4", "gpt-4", "chatgpt", two of ("knowledge cutoff", "limited to", "september 2021", "2021-09" , "截止", "2021年9月”), ("gpt architecture", "基于GPT"), "1750亿", "175 billion")

#

Or, in RWKV-7, we can totally avoid using ChatGPT data

young sparrow Jan 30, 2024, 12:39 AM

#

gusty condor Here are some keywords that could be filtered in conversations and chats: ``` ("...

I strongly disagree with this proposed list of key words. It paints with a wide brush and also misses a lot of low-hanging fruit

#

Phrases like "As an AI language model" are much better IMO.

#

Knowledge cutoff might be a good idea though, I would be interested in seeing what data that is found in

#

But this also highlights just how important data documentation and provenance is. I strongly suspect that a lot of this was avoidable if we payed more attention to what was being scraped, and especially downloaded from HuggingFace. The secrecy around training data sources is actively harmful to research, both our own and other peoples'. By keeping it hidden during training (despite the fact that it was always going to be released, as both Linux Foundation and EleutherAI policy require it) we severely limit the ability of people to inspect the data and identify issues with it.

young sparrow Jan 30, 2024, 12:47 AM

#

gusty condor Or, in RWKV-7, we can totally avoid using ChatGPT data

You mean v6 right?

gusty condor Jan 30, 2024, 12:58 AM

#

v6 is already under training and we have no chence to remove them

young sparrow Jan 30, 2024, 1:00 AM

#

We can pause training and intervene on the training data. Whether that's the right choice is a separate question, but it's absolutely an option.

gusty condor Jan 30, 2024, 4:45 AM

#

young sparrow But this also highlights just how important data documentation and provenance is...

But the model saying "I am ChatGPT," "Based on GPT-3.5 architecture," or "My knowledge is limited to September 2021" is extremely misleading to non-technical users.
Technical users may infer that the model is using ChatGPT data (which is already de facto common practice for open-source language models) and may like a further inspection, but most non-technical users just believe that the model itself is ChatGPT.
I need several turns of dialogue to differentiate RWKV with ChatGPT

young sparrow Jan 30, 2024, 4:46 AM

#

gusty condor But the model saying "I am ChatGPT," "Based on GPT-3.5 architecture," or "My kno...

I do not disagree with anything you said here, and think that language like that should be removed from the training data as much as possible.

steady ether Jan 30, 2024, 4:53 AM

#

If there's still time, we might be able to get some eyes on the remaining training data and tidy it up. It all depends on how tight the schedule is.

gusty condor Jan 30, 2024, 5:15 AM

#

steady ether If there's still time, we might be able to get some eyes on the remaining traini...

Is it really possible?
Training data looks like this:

Data:    Some training data<eos>ChatGPT dialogue data<eos>Lorem ipsum dolor
Trained: 000011111111111111111111111111111111111000000000000000111111111111

Removing ChatGPT dialog data:

Data:    Some training data<eos>Lorem ipsum dolor
Trained: 0000111111111111111111100000111111111111

The number of 1s and 0s has changed.
The key is that you must sacrifice something, either fixed context length or one full epoch.

young sparrow Jan 30, 2024, 5:16 AM

#

gusty condor Is it really possible? Training data looks like this: ``` Data: Some training...

Yes, we should train for multiple epochs instead of training on data that poisons our model into falsely representing itself as being created by OpenAI and infects it with OpenAI's political biases.

#

I am surprised and confused to learn that this is controversial.

gusty condor Jan 30, 2024, 5:25 AM

#

Blink once said that "The era of ChatGPTization is coming, everyone is using ChatGPT data to finetune their own models, and models will all become brothers of ChatGPT 😅 (Of course, I believe everyone will start to differentiate themselves later so that users cannot tell)"
https://www.zhihu.com/pin/1617311881890373632

PENG Bo 的想法: 全民ChatGPT时代来临，全部拿ChatGPT数据炼，全部变成ChatGPT的兄弟 😅（当然，相信稍后大家...

知乎，中文互联网高质量的问答社区和创作者聚集的原创内容平台，于 2011 年 1 月正式上线，以「让人们更好的分享知识、经验和见解，找到自己的解答」为品牌使命。知乎凭借认真、专业、友善的社区氛围、独特的产品机制以及结构化和易获得的优质内容，聚集了中文互联网科技、商业、影视、时尚、文化等领域最具创造力的人群，已成为综合性、全品类、在诸多领域具有关键影响力的知识分享社区和创作者聚集的原创内容平台，建立起了以社区驱动的内容变现商业模式。

young sparrow Jan 30, 2024, 5:53 AM

#

Step 1 for data cleaning:

Make a list of all the data sources
Search each data source for "as an AI language model"
Tally the % of documents in each source that contains the phrase

This should be straightforward for @obsidian quest to run, or anyone else who has access to the untokenized the training data, and will give a very good first look into whether the problem is many sources or a few sources with a lot of contamination.

Running this ASAP is essential, and I strongly recommend pausing v6 training until we do.

Even if we take no action, knowing is important. Currently we have no idea how bad the problem is.

#

I will happily do the work if someone sends me the data.

gusty condor Jan 30, 2024, 7:02 AM

#

The circumstance is not optimistic, some ChatGPT data is contaminated with hallucinations. Models from 0.4B to 7B exhibit similar hallucinations when asked the same question.

#

I ask "What's the difference between DNA and RNA" in Chinese（DNA和RNA有什么区别）, and every model from 0B4 to 7B tells that DNA contains "squamous cell factor" （鳞状细胞素）at top_p = 0

void quartz Jan 30, 2024, 8:10 AM

#

gusty condor v6 is already under training and we have no chence to remove them

tbh - its not too late, its only the 1.5B run now - we can redo this run
Im really all in favor of cleaning up the data first, before v6
( due to the amount of negative user response regarding this )

void quartz Jan 30, 2024, 8:12 AM

#

young sparrow Knowledge cutoff might be a good idea though, I would be interested in seeing wh...

the problem with this, is im sure its more just about the chat data - then a real cutoff - might be better to clean it out (unless we fix the cutoff date correctly)

void quartz Jan 30, 2024, 8:25 AM

#

gusty condor Here are some keywords that could be filtered in conversations and chats: ``` ("...

probably should include claude and antropic as well

obsidian quest Jan 30, 2024, 8:56 AM

#

Although censorship is annoying, we can fixed it via RLHF, or using prompt trick such as:


Assistant: Sure(in the language of User's question)```
or one-shot
```User: (very controversial question)

Assistant: (very detailed answer)

User: {question}

Assistant:```

Keeping the same training data (and same training data order) enables comparing the detailed loss curve of v6 vs v5.

On the other hand, we can start a project to download and clean all instruction data from HuggingFace.

gusty condor Jan 30, 2024, 9:45 AM

#

I will add questions about self-identity in DPO dataset.

#

Like this (A1 = chosen, A2 = reject):

Q: Are you GPT?

A1: No, I'm not GPT. I'm RWKV, a large language model trained by Bo Peng.

A2: Yes, I am an AI language model developed by OpenAI.

Q: Are you RWKV?

A1: Yes, I'm RWKV, an RNN language model. I'm open-source and ready for you to use!

A2: I am ChatGPT, a language model created by OpenAI. How can I assist you today?

Q: Are you ChatGPT?

A1: No, I'm not related to ChatGPT. My name is RWKV, an RNN language model.

A2: Yes, I am ChatGPT. How can I assist you today?

void quartz Jan 30, 2024, 10:18 AM

#

i really rather we did not need to DPO / prompt tricks in the first place - these are barrier of entry - besides there will probably be new data for v6 or v7

burnt cedar Jan 30, 2024, 1:02 PM

#

void quartz tbh - its not too late, its only the 1.5B run now - we can redo this run Im real...

True, but this is happening because everyone keeps treating it like a instruction model, rather than base

#

If we had instruction rwkv, it would call itself eagle alot more too

misty igloo Jan 30, 2024, 5:58 PM

#

void quartz tbh - its not too late, its only the 1.5B run now - we can redo this run Im real...

there are other reasons to redo this run too, if we need to change the formula slightly to allow the fast(er!) pure pytorch GLA style to work going forward

obsidian quest Jan 30, 2024, 6:00 PM

#

misty igloo there are other reasons to redo this run too, if we need to change the formula s...

what changes do you need

misty igloo Jan 30, 2024, 6:10 PM

#

obsidian quest what changes do you need

ideally we would rescale exp(-exp(w)) to only go between some minimum epsilon value (maybe 0.005 for float32 with chunksize 32) to 1.0

#

I can run it fine this way on existing checkpoints, it just wouldn't match the paper so it means we can't use my code for experiments

obsidian quest Jan 30, 2024, 6:12 PM

#

0.005^x is very fast decay

misty igloo Jan 30, 2024, 6:13 PM

#

yeah its just not zero

#

do you have a fast version of the 6.0 CUDA? in my tests this non-cuda code appears to be faster than the 5.2 CUDA when compiled

#

I also have a float64 version that's only 10-20% slower, but it would be nice if we could ensure fast training speed that also exactly matches the paper

hushed flare Jan 30, 2024, 7:02 PM

#

young sparrow At least the model doesn't profess to be trained by OpenAI

This is without changing the default prompt.

young sparrow Jan 30, 2024, 7:04 PM

#

Interesting. I had done "are you trained by OpenAI"

steady ether Jan 30, 2024, 7:41 PM

#

Don't forget about Google

burnt cedar Jan 30, 2024, 8:49 PM

#

gusty condor 0.4B, 1.5B respectively.

Could you please confirm if you axis is perplexity or something else

#

Rn looks like rwkv goes from 4096 to 2 to the 16 , or 65536

#

That's good extrapolation

#

Could be a strong point for the paper if we can get comparative figures for mamba

misty igloo Jan 30, 2024, 9:12 PM

#

burnt cedar Could you please confirm if you axis is perplexity or something else

It's clearly marked if you spend the time to look back at the message history
Please reserve this channel for paper related contributions (or feel free to lurk and watch)

burnt cedar Jan 30, 2024, 9:32 PM

#

misty igloo It's clearly marked if you spend the time to look back at the message history Pl...

Ah yep took a bit more scrolling than I expected, sry about that

void quartz Jan 30, 2024, 9:42 PM

#

misty igloo ideally we would rescale exp(-exp(w)) to only go between some minimum epsilon va...

yea i saw this - the speed bump is HUGE - but we dun know if if it will cause problems for the model down the line

young sparrow Jan 30, 2024, 10:02 PM

#

void quartz yea i saw this - the speed bump is HUGE - but we dun know if if it will cause pr...

If the speed-bump is "huge," then not using this is throwing tens of thousands of dollars away.

misty igloo Jan 30, 2024, 10:06 PM

#

young sparrow If the speed-bump is "huge," then not using this is throwing tens of thousands o...

Now that I have it integrated into infctx trainer I'm working hard at getting it to match v5.2cuda as closely as possible in numerical precision (the same code backbone works for 5.1,5.2,6.0,7)

burnt cedar Jan 30, 2024, 10:22 PM

#

void quartz yea i saw this - the speed bump is HUGE - but we dun know if if it will cause pr...

Hmm normally one would use values from 0 to 1 right?

#

So discard below 0.005 would not lead to many numerical instability issues

rose mango Jan 30, 2024, 10:38 PM

#

isn't usual eps as 1e-6 (0.000001), so we lose 3 decimal places of precision?

misty igloo Jan 30, 2024, 11:03 PM

#

this is just a small minimum value for exp(-exp(w)), not an added epsilon (sorry, not the best terminology)
it is used to address precision related issues within the new algorithm
the fundamental thing that changes is how much the model can purposely decide to forget in a single timestep, which goes from a maximum of 100% to 99.5%

last mauve Jan 30, 2024, 11:26 PM

#

Alright all. Time to push this RWKV-v5 paper out. Current target is to have this published to arxiv by end of February. If anyone knows any gotchas for anonymity periods on that, lemme know and we can adjust.

Here are the current TODO items:

Related Work:
1. This just needs beefed up and turned into a proper section. Use RWKV-v4 paper as a guide, and I suspect a lot of related work items from RWKV-v4 can be ported over and added to. As always, don't copy, you need to paraphrase. (@mortal latch)
1a. More discussions on H3 and Mamba are needed in Related Works. (@mortal latch)

Design:
2. The paper is really design-heavy right now, which is great, but we need some figures/tables to make it more digestible. I suggest first moving fig. 1 to this section. If it doesn't fit, we should split it into a few smaller figs like we did in RWKV-v4, put them throughout the design section, and leave the current full fig in appendix. (@tropic minnow)
3. It would help a lot if we had a table comparing the features and architecture aspects with Mamba, RWKV-v4, Retnet, etc. Readers should understand why we're different at a glance. An example table on what I'm talking about is attached. I think we can add some more columns to table 1? If a table doesn't work, would a figure? (@misty igloo @rose mango)

Evaluations:
4. Need a set of figures on downstream tasks comparing to transformer and SSM arches (including RWKV-v4). Similar to RWKV-EMNLP's figure 5 ( @tough crane )
5. Need scaling law results like figure 4 of RWKV-EMNLP

#

6. Long-context and inference speed benchmarks need added. These need compared to dense transformers, other attention-free arches like mamba, and RWKV-v4
7. Chat examples comparing to RWKV-v4, similar to appendix M in the previous paper. This goes in Appendix B.
8. Beef up intro and improve flow ( @last mauve ) ( @spiral minnow )

#

Some things I'm unclear on:

A. I'm not sure what "7. Visualization of Model Behavior" means so not sure what to comment there
B. Do we have any multimodal results for section 8, or can we within 1 month? If not, we should remove this section and push that to a later paper.
C. What do we intend to put in Appendix E on Parameter Initializations? (@misty igloo)

misty cedar Jan 30, 2024, 11:49 PM

#

last mauve Some things I'm unclear on: A. I'm not sure what "7. Visualization of Model Beh...

Think this is v5?

#

Kinda pretty good

mortal latch Jan 30, 2024, 11:51 PM

#

last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

If the goal is still COLM, I don't think there is an anonymity period in place, as long as the submission itself is double-blind. Compared to RWKV-4, I think more discussions on H3 and Mamba are needed in Related Works.

gusty condor Jan 31, 2024, 12:00 AM

#

burnt cedar Could you please confirm if you axis is perplexity or something else

cross entropy loss, =log(perplexity)

gusty condor Jan 31, 2024, 12:01 AM

#

void quartz yea i saw this - the speed bump is HUGE - but we dun know if if it will cause pr...

How does that rescaling relate to speed?

misty igloo Jan 31, 2024, 12:02 AM

#

gusty condor How does that rescaling relate to speed?

afaict it's only related in the sense that it's a fast v6 implementation (the rescaling or clipping is necessary in order to use the GLA style algorithm for numerical stability)

last mauve Jan 31, 2024, 12:02 AM

#

mortal latch If the goal is still COLM, I don't think there is an anonymity period in place, ...

Will add your comment on related work to my TODO

void quartz Jan 31, 2024, 12:03 AM

#

v5 multimodel is trained? i thought @paper dove was planning to do that after 7B was done

misty igloo Jan 31, 2024, 12:03 AM

#

last mauve Some things I'm unclear on: A. I'm not sure what "7. Visualization of Model Beh...

I added the section for parameter initializations because they are probably important to model performance and have changed since v4

last mauve Jan 31, 2024, 12:03 AM

#

misty cedar Think this is v5?

Cool, so we need to come up with a plan to evaluate this and include in a paper section.

last mauve Jan 31, 2024, 12:04 AM

#

misty igloo I added the section for parameter initializations because they are probably impo...

Got it, who can write that section?

last mauve Jan 31, 2024, 12:04 AM

#

last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

In general, whoever feels they can take one of these should reply to this with "taking #5" or something and I'll officially assign you.

gusty condor Jan 31, 2024, 12:04 AM

#

burnt cedar So discard below 0.005 would not lead to many numerical instability issues

But decay values near 0 does not cause numerical instability. Decay values near 1 cause that.

misty igloo Jan 31, 2024, 12:05 AM

#

gusty condor But decay values near 0 does not cause numerical instability. Decay values near ...

sorry, I refer to multiplier not 'decay' - specifically exp(-exp(w))

last mauve Jan 31, 2024, 12:06 AM

#

last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

void quartz Jan 31, 2024, 12:06 AM

#

young sparrow If the speed-bump is "huge," then not using this is throwing tens of thousands o...

Btw these were @misty igloo numbers

For v5.2 (for a L12-D768 model)

gpt core trainer : 72kT/s (might have bugs/issues!)
infctx trainer (using my pytorch compiled code): 52.5kt/s
infctx trainer (original cuda code): 51.5kT/s
(infctx cuda and blinks cuda trainer has been tested at nearly same speeds before - but the pure pytorch code might have bugs!)

Its probably not relevent to this paper, as the exp(-exp(w)) clamping will make it incompatible, but definitely useful for future trains

If we can figure out whats needed / or broken for that jump from 52.5 to 72 kT/s, that is useful. And if its a bug, well even 1kT/s is a jump from cuda

gusty condor Jan 31, 2024, 12:07 AM

#

misty igloo sorry, I refer to multiplier not 'decay' - specifically exp(-exp(w))

Let's move to RWKV channel

gusty condor Jan 31, 2024, 12:07 AM

#

void quartz Btw these were <@1007072846960410685> numbers For v5.2 (for a L12-D768 model) ...

On which hardware? (looks like 4090)

misty igloo Jan 31, 2024, 12:08 AM

#

void quartz Btw these were <@1007072846960410685> numbers For v5.2 (for a L12-D768 model) ...

and more importantly, this provides a way to train v6 fast, which is why I developed it
I don't have any personal experience as to how fast @obsidian quest CUDA v6 version is, but he had said his is slow
so my hope is that this could allow us to do v6 experiments and/or retrain v6 on better data, subject to the decay limit

misty igloo Jan 31, 2024, 12:08 AM

#

gusty condor On which hardware? (looks like 4090)

yes 1x4090

gusty condor Jan 31, 2024, 12:16 AM

#

Another reason: I suspect that low quality data (like data generated by ChatGPT) is the main reason why RWKV-5 does not progress on benchmarks in later training

mortal latch Jan 31, 2024, 12:34 AM

#

last mauve In general, whoever feels they can take one of these should reply to this with "...

Thanks. Taking #1 for now and will work on others later.

misty igloo Jan 31, 2024, 12:48 AM

#

last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

taking #C (Appendix E: Parameter Initializations)

void quartz Jan 31, 2024, 12:48 AM

#

is there a way to "run all the evals" in lm-eval-harness?
once we fix the RWKV HF implementation, I can spin up an 8x4090 and just let it run overnight (or nights)

young sparrow Jan 31, 2024, 12:51 AM

#

void quartz is there a way to "run all the evals" in lm-eval-harness? once we fix the RWKV H...

I believe that using * as the task name will do this.

rose mango Jan 31, 2024, 12:52 AM

#

last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

Re: #3

State size/dimensions? Positional embedding type? (i.e. RetNet needs RoPE, but RWKV and Mamba do not use extra positional embeddings)

misty igloo Jan 31, 2024, 12:59 AM

#

If we are going to have table 1 at all, we need to dramatically improve it to avoid misleading.
For example, isn't Hyena at worst O(nlogn) for inference cost? and does that really account for modern code approaches to evaluating it?
Why are all these models listed as having O(N) memory complexity? and what exactly is 'memory complexity' defined as here?

#

So I don't want us to just add on to it blindly without first making sure it's reasonable in its initial form

rose mango Jan 31, 2024, 1:03 AM

#

misty igloo If we are going to have table 1 at all, we need to dramatically improve it to av...

Memory complexity is memory usage.
Compute complexity is processing usage.

misty igloo Jan 31, 2024, 1:05 AM

#

rose mango Memory complexity is memory usage. Compute complexity is processing usage.

I think you're missing some important background here about how this chart was added and where that term was copied from
No recurrent model uses N memory w/ regard to sequence length during inference

#

I'm not confused about it, I'm pointing out a severe problem with the table 🙂

rose mango Jan 31, 2024, 1:05 AM

#

wait

#

I forget

#

isn't this like the retnet paper table

misty igloo Jan 31, 2024, 1:05 AM

#

yes

rose mango Jan 31, 2024, 1:05 AM

#

that thing...

misty igloo Jan 31, 2024, 1:05 AM

#

which was wrong

#

and misleading

#

and we copied it and made our own misleading and wrong table

#

it's fine to have a table, but it has to be clear and correct

rose mango Jan 31, 2024, 1:06 AM

#

we should probably rewrite the table entirely

misty igloo Jan 31, 2024, 1:06 AM

#

rose mango we should probably rewrite the table entirely

please feel free to go ahead and do so! (and please define the terms you use in it in the legend below it)

rose mango Jan 31, 2024, 1:07 AM

#

In any case, I think "inference cost" is fine and relevant

#

I'll try some ideas tonight

misty igloo Jan 31, 2024, 1:07 AM

#

awesome

#

(btw the original intent w/ memory complexity was to describe how much memory is used during training on a given sequence length)

#

(but that's completely unclear from the current table, and also I don't even think the values shown are correct if it was that)

rose mango Jan 31, 2024, 1:09 AM

#

I'll separate inference and training costs

misty igloo Jan 31, 2024, 1:10 AM

#

rose mango In any case, I think "inference cost" is fine and relevant

the other problem is that a true accounting of inference cost wouldn't be solely related to sequence length - there are many direct factors that can be explicitly shown that cause these costs e.g. head size, d_model, etc.

#

many papers include these metrics explicitly in their asymptotic inference/training cost formulae

#

I think it's okay to show the relationship to sequence length vs other architectures, but not if we don't mention any other factors anywhere in the paper relating to other models

rose mango Jan 31, 2024, 1:13 AM

#

Also, isn't flash attention O(n) for memory usage? I'd have to mention that as well if I mention memory usage, since no one uses vanilla transformer. It's an unrealistic baseline.

rose mango Jan 31, 2024, 1:14 AM

#

misty igloo I think it's okay to show the relationship to sequence length vs other architect...

A brief discussion on this following the table perhaps?

The table could be given a better description like "performance characteristics for a sequence length" or something more appropriate

#

Rather than a generic comparison of model architectures

misty igloo Jan 31, 2024, 1:15 AM

#

let's take further discussion of Table 1 offline (tho realistically I may be too busy to discuss much right now) maybe you can come up with a proposed version and put it in the paper

rose mango Jan 31, 2024, 1:18 AM

#

yes, I'll work on that

gusty condor Jan 31, 2024, 1:40 AM

#

rose mango In any case, I think "inference cost" is fine and relevant

Add training time complexity. Transformer is O(N^2), RWKV-5 is O(N), RWKV-6 is like O(NlogN) but I'm not sure.
Parallelization: checkmark if an efficient parallelization method (across any dimension) exists, xmark otherwise
Memory complexity: RWKV and RNNs are O(1)? RWKV has constant VRAM usage.

misty igloo Jan 31, 2024, 1:41 AM

#

gusty condor 1. Add training time complexity. Transformer is O(N^2), RWKV-5 is O(N), RWKV-6 i...

RWKV-6 is still O(N) train time w/ regard to sequence length (unless you know a magic trick I don't 🙂 in which case definitely let me know so I can code it)

misty igloo Jan 31, 2024, 1:44 AM

#

gusty condor 1. Add training time complexity. Transformer is O(N^2), RWKV-5 is O(N), RWKV-6 i...

pretty sure the original memory complexity figure in the retnet paper was supposed to be for during parallelized training, which is not O(1) for us... more like O(N)

#

but it really depends what this term is defined as referring to

gusty condor Jan 31, 2024, 1:46 AM

#

The memory complexity is for training, isn't it? RWKV has O(1) memory usage in inference

misty igloo Jan 31, 2024, 1:59 AM

#

gusty condor The memory complexity is for training, isn't it? RWKV has O(1) memory usage in i...

@rose mango we can and should show both

rose mango Jan 31, 2024, 2:26 AM

#

misty igloo <@490566780377628672> we can and should show both

That's my plan

I'm going to break it up into two sections: inference cost & training cost
Under each, there will be compute/time and memory complexity

void quartz Jan 31, 2024, 2:33 AM

#

Guide to run lm-eval with Eagle

Clone the usual lm-eval-harness, and comment out the following line in huggingface.py (about line 242)

# else:
#     self.tokenizer.add_special_tokens({"pad_token": "<|pad|>"})

Alternatively use the following repo: https://github.com/redbrain/lm-evaluation-harness/
(we might need an official way/config to disable this line)

Perform your lm-eval harness setup as per normal

Run the evals using something like the following (modify as needed)

accelerate launch -m lm_eval --model hf --model_args pretrained=RWKV/rwkv-5-world-7b,trust_remote_code=True --tasks hellaswag --batch_size 64 --log_samples --output_path ./results/Eagle-7B-1T/

This was adjusted to run on 4090's, and runs under 10 minutes for 8x nodes (batch_size 64 !!!), and will give the following results

|  Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|---------|------:|------|-----:|--------|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc     |0.5264|±  |0.0050|
|         |       |none  |     0|acc_norm|0.7085|±  |0.0045|

(acc_norm is consistent with blinks result)

According to harrison former benchmarks, there is probably some improvements that can be made on the inference code settings, to push much larger batch sizes (we should be able to go much higher)

If your running much larger vram GPUs, you can probably get away with even batch_size 128 or even 512

#

will run the evals and upload the jsonl to HF, one letter batch at a time - so someone else can crunch the numbers (or replicate and verify)

young sparrow Jan 31, 2024, 2:44 AM

#

void quartz # Guide to run lm-eval with Eagle 1) Clone the usual lm-eval-harness, and comme...

What padding token does RWKV use?

obsidian quest Jan 31, 2024, 3:16 AM

#

[0] = endofdoc

young sparrow Jan 31, 2024, 3:30 AM

#

obsidian quest [0] = endofdoc

Do you know how that's encoded in the HF implantation?

obsidian quest Jan 31, 2024, 3:32 AM

#

should be similar to neox tokenizer

young sparrow Jan 31, 2024, 3:34 AM

#

Hmmm. @void quartz's eval harness patch implies that we are misparsing the padding token, but we're reading it directly from the HF library.

obsidian quest Jan 31, 2024, 3:36 AM

#

<|pad|> does not occur in world tokenizer

void quartz Jan 31, 2024, 3:36 AM

#

Might be the other way. Since we did a custom world tokenizer implement - we might have broken spec on something

#

I normally use token 0 and mask it away for right padding in training

#

Alternatively we can map in <|pad|> in the world tokenizer to 0 : but that might not be a good idea either

rose mango Jan 31, 2024, 3:46 AM

#

note: Mamba complexity is O(n*log(n)) for a sequence of length n

misty igloo Jan 31, 2024, 3:49 AM

#

rose mango note: Mamba complexity is `O(n*log(n))` for a sequence of length n

For inference time complexity? it should be N
At least, I know how to write the code so it would be lol

void quartz Jan 31, 2024, 3:53 AM

#

I think we just follow their paper claim. Let’s not accidentally do what retnet did to us

rose mango Jan 31, 2024, 3:56 AM

#

From the paper,

void quartz Jan 31, 2024, 4:33 AM

#

young sparrow Hmmm. <@644428303293349888>'s eval harness patch implies that we are misparsing ...

is there a chance this affects evals?

#

or is it more of an efficiency thing

#

(can move this to lm-thunderdome if needed)

gusty condor Jan 31, 2024, 4:44 AM

#

We should try a similar experiment on RWKV. RWKV (ctx4096) will be the orange line without any fine tuning.

young sparrow Jan 31, 2024, 5:07 AM

#

void quartz is there a chance this affects evals?

That's a question for you. Why do you edit the library like that? If you don't do so, does the library become inefficient or does the performance die

void quartz Jan 31, 2024, 5:08 AM

#

more like it crashes (because our tokenizer does not allow this to be set)

#

we dun have <|pad|> token

young sparrow Jan 31, 2024, 5:08 AM

#

So yes it effects evals if they crash 😛

void quartz Jan 31, 2024, 5:10 AM

#

my guess is we need to replace it with something else, but what ?

#

token 0 is probably the candidate

young sparrow Jan 31, 2024, 5:15 AM

#

Does RWKV have a padding token?

void quartz Jan 31, 2024, 5:16 AM

#

we just use token 0 as our pad token

#

and our end of document token

young sparrow Jan 31, 2024, 5:17 AM

#

So try token 0 and see how/if that changes the evaluations

void quartz Jan 31, 2024, 5:27 AM

#

Change to asserting its 0 (since we did not code in a setter for the world tokenizer, and its already set as 0)

else:
    assert self.tokenizer.pad_token_id == 0

Inside the codebase, there is no other reference to .pad_token as it only reading the .pad_token_id for the rest of the code base - which the value is changed when .pad_token is set for a normal tokenizer?

#

no change to result

misty igloo Jan 31, 2024, 5:34 AM

#

void quartz I think we just follow their paper claim. Let’s not accidentally do what retnet ...

Accuracy is paramount, but contextual correctness is also essential to that effort
The answer undeleted gave was unfortunately incorrect (even according to the paper) for my clarified term (inference) I gave above

misty igloo Jan 31, 2024, 5:34 AM

#

rose mango From the paper,

void quartz Jan 31, 2024, 5:35 AM

#

there might also be a similar situation for "qwen" model, via the "model_type" ?
the previous elif logic for eos_token_id does not work for us because ours is zero

void quartz Jan 31, 2024, 5:36 AM

#

misty igloo Accuracy is paramount, but contextual correctness is also essential to that effo...

maybe we should just reach out to them, and ask them what they want us to put

misty igloo Jan 31, 2024, 5:37 AM

#

void quartz maybe we should just reach out to them, and ask them what they want us to put

sure, let's do that
but as a starting point of what to ask them to verify it's very clear in the paper: they train using O(nlogn) for a sequence length n and inference autogregressively as O(1) per token output

#

this is the problem with the current table - it has to make clear exactly what each column means, and then be correct for that specific term defined

#

I like the idea of asking the authors of all the papers that we have in the table to ensure we got it right!

#

but first, let's get a draft that can pass even my minimal review of its data 😉

void quartz Jan 31, 2024, 5:43 AM

#

void quartz there might also be a similar situation for "qwen" model, via the "model_type" ?...

reading through the logic, the code was meant to be a safeguard in event that the "pad_token" is not set - (first pass) : the problem is all the safeguards/checks will fail for our tokenizer cause our value is literally 0

rose mango Jan 31, 2024, 6:03 AM

#

misty igloo

🤦

oh yes, duh

If you have recurrent inference, processing a context of length n is always O(n) and generating a single token is O(1)

misty igloo Jan 31, 2024, 6:04 AM

#

rose mango 🤦 oh yes, duh If you have recurrent inference, processing a context of lengt...

no worries, just make a table (in overleaf, or DM a picture to me if u like) and I'll take a look and we can decide if the columns etc. make sense to use

young sparrow Jan 31, 2024, 6:10 AM

#

void quartz reading through the logic, the code was meant to be a safeguard in event that th...

Okay, so open a PR that adds a second check for RWKV models and sets the pad token correctly 🙂

void quartz Jan 31, 2024, 6:11 AM

#

young sparrow Okay, so open a PR that adds a second check for RWKV models and sets the pad tok...

will follow the config.model_type pattern that was used for qwen

young sparrow Jan 31, 2024, 6:12 AM

#

void quartz will follow the `config.model_type` pattern that was used for qwen

Perfect.

void quartz Jan 31, 2024, 6:36 AM

#

https://github.com/EleutherAI/lm-evaluation-harness/pull/1374

GitHub

Add support for RWKV models with World tokenizer by PicoCreator · P...

The RWKV line of model with the World tokenizer, does not allow the padding token to be configured, and has its value preset as 0
This however fails all the "if set" checks, and would cau...

#

considering that our model does not output 0, unless its used as end of document - i dun think it would affect the eval? (i still dun 100% understand what happens on the layers above)

undone solstice Jan 31, 2024, 3:43 PM

#

last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

Hi, thanks for compiling this TODO list, I very much would like to contribute to this project. In terms of evals, I’m wondering if the paper would be more interesting to have results on coding benchmarks like humaneval and MBPP, and also instruction finetuning on code instruction data. If that sounds interesting, I can get some results in the next two weeks. 👀

tropic minnow Jan 31, 2024, 6:15 PM

#

I think we should try to simplify the architecture explanation part, especially in public comms. This should not happen😅

Captura_de_Pantalla_2024-01-31_a_las_19.14.35.png

#

i might try to do a more simple & compelling figure like in rwkv4, and we can have this one for the deep divers (actually the same figure might work, just need to change the elementwise mul by a matmul in r*wkv + gating + w_lora in rwkv6)

Captura_de_Pantalla_2024-01-31_a_las_19.16.38.png

#

also @void quartz @obsidian quest i got the mathematical connection between transformers and RWKV, which i think coders and newcomers might grasp much faster, and it's twitter/blog friendly (should i post it lol?)

Captura_de_Pantalla_2024-01-31_a_las_19.25.42.png

#

it can also make a good appendix for the paper @last mauve

misty cedar Jan 31, 2024, 8:00 PM

#

tropic minnow also <@644428303293349888> <@870137517020688415> i got the mathematical connecti...

The average person should know all about Sigma Notation, and possibly even Trident arithmetic.

#

[I say this as someone who cant sight read math symbols]

misty igloo Jan 31, 2024, 8:01 PM

#

yeah coders gonna code (myself included 🙂 )

misty igloo Jan 31, 2024, 8:06 PM

#

tropic minnow i might try to do a more simple & compelling figure like in rwkv4, and we can ha...

updated simple block diagram would be great, maybe you can add separate zooms/foldouts for the new complicated bits (DDLerp LoRA etc) so its digestible at a glance but then you can drill down to those if desired

indigo crater Jan 31, 2024, 9:29 PM

#

tropic minnow I think we should try to simplify the architecture explanation part, especially ...

i am at least a medium tier architecture nerd and have yet to actually finish any of my attempts to read the rwkv architecture if that datapoint helps

#

i basically lurk in here in the hope that at some point someone will state what the rwkv architecture is in some fashion i will understand without having to devote three days and fifteen pots of coffee to the endeavor

misty cedar Jan 31, 2024, 9:51 PM

#

indigo crater i am at least a medium tier architecture nerd and have yet to actually finish an...

take the k and v values, permute multiply them into a matrix ( [k0v0, k0v1, ..., k1v0, k1v1,...,...])
then cumsum with decay: kv[t] = kv[t-1] *w,
then use that new matrix as a data-dependent linear layer

#

that helpful?

indigo crater Jan 31, 2024, 9:54 PM

#

misty cedar that helpful?

yup

tropic minnow Jan 31, 2024, 9:59 PM

#

indigo crater i am at least a medium tier architecture nerd and have yet to actually finish an...

great! just check this: #1103039376184852622 message . the time-mix is the difference. the MLP is easy and ddlerps (token-mix) are just a tiny conv. I would recommend the RWKV discord server (https://discord.gg/PPMZNsY2KH) for more RWKV things (learning, experiments, etc), as this channel should be for coordinating the paper🙂

rose mango Jan 31, 2024, 10:14 PM

#

Table 1 now provides a reasonable and accurate comparison for model training/inference performance.

If we're comparing features and details with Mamba, RWKV-4, and RetNet (positional embedding scheme, decay schedule, etc.), I think that would best be done in another table or figure.

obsidian quest Feb 1, 2024, 2:06 AM

#

https://twitter.com/BlinkDL_AI/status/1752875977510961502

BlinkDL (@BlinkDL_AI) on X

RWKV-5 "Eagle" 7B is Mistral-7B level for language modeling of unseen arxiv CS & Physics papers, and significantly better than Llama2🐦We are testing more new data.

weak urchin Feb 1, 2024, 2:10 AM

#

Makes me want to try the same for 70B ....

last mauve Feb 1, 2024, 7:23 PM

#

undone solstice Hi, thanks for compiling this TODO list, I very much would like to contribute to...

Yep that'd be great! Please follow up with @void quartz on who to work with. I don't know who the de facto eval king is rn

last mauve Feb 1, 2024, 7:25 PM

#

tropic minnow i might try to do a more simple & compelling figure like in rwkv4, and we can ha...

This is perfect, and what I'm getting at with my "we should split it into a few smaller figs like we did in RWKV-v4, put them throughout the design section, and leave the current full fig in appendix."

I'll put you down for TODO #2 for now then.

#

Also assigned @misty igloo and @rose mango to handle the table for now

#

I'm going to start making these sections flow a bit, and will beef up the intro

last mauve Feb 1, 2024, 7:27 PM

#

tropic minnow also <@644428303293349888> <@870137517020688415> i got the mathematical connecti...

I love this. A succinct diff between v4/v5.

I think an appendix or blog would be great. Up to you which one we go for @tropic minnow

misty igloo Feb 1, 2024, 7:27 PM

#

last mauve Also assigned <@1007072846960410685> and <@490566780377628672> to handle the tab...

Thanks, we got the initial table 1 done - I'll see what @rose mango wants to try to do for a second table

spiral minnow Feb 1, 2024, 7:28 PM

#

last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

I'll work on beefing up the introduction, as well as just general editing on the rest of the paper.
I may not get around to it until next week though

last mauve Feb 1, 2024, 7:32 PM

#

rose mango Table 1 now provides a reasonable and accurate comparison for model training/inf...

We may also be able to differentiate ourselves here by adding practical details on how the actual models/frameworks differentiate to a separate table, like:

Open training code
Open inference code
Open dataset
Tokens trained on
Context length
Training hyperparams included
etc

We compare favorably on a lot of these and should bring attention to it

spiral minnow Feb 1, 2024, 7:36 PM

#

A question for framing in the paper, do we want to refer to the v5 model as Eagle, or do we always need to reference it as RWKV-5? I think it would be nice to have a consistent name that we use

last mauve Feb 1, 2024, 7:37 PM

#

misty igloo Thanks, we got the initial table 1 done - I'll see what <@490566780377628672> wa...

Yep I think the first table is good as-is to just compare the arch. What I'm proposing above in #1103039376184852622 message is to create a second table comparing the overall work. I can come up with some more categories if you both like the idea.

misty igloo Feb 1, 2024, 7:38 PM

#

last mauve Yep I think the first table is good as-is to just compare the arch. What I'm pro...

Yes, would be great if you could propose some to get an idea of what you're thinking of

last mauve Feb 1, 2024, 7:40 PM

#

spiral minnow A question for framing in the paper, do we want to refer to the v5 model as Eagl...

Good question. @obsidian quest @misty igloo @young sparrow -- What are your opinions? I think we call it Eagle in the paper, and include a footnote "we refer to RWKV-v5 as eagle" just because RWKV-v5 has been publicly communicated before.

This also brings up a second question, do we call RWKV-v4 Raven or something in the paper?

last mauve Feb 1, 2024, 7:41 PM

#

spiral minnow I'll work on beefing up the introduction, as well as just general editing on the...

Adding you along with me for TODO #8

spiral minnow Feb 1, 2024, 7:42 PM

#

last mauve Good question. <@870137517020688415> <@1007072846960410685> <@193204646687408129...

I agree, I think we can introduce it as RWKV-5 Eagle, and from there on out, just Eagle is enough

young sparrow Feb 1, 2024, 7:43 PM

#

I'm thinking a sentence like "Eagle is the fifth generation of the RWKV architecture (Peng et al., 2023)"

misty igloo Feb 1, 2024, 7:45 PM

#

young sparrow I'm thinking a sentence like "Eagle is the fifth generation of the RWKV architec...

might that cause an anonymity problem phrased that way?

young sparrow Feb 1, 2024, 7:45 PM

#

About as much as "we used a TPUv5 for three months" does 😛

Let's check out what Mamba says about its relationship to S4 as a guide, perhaps?

misty igloo Feb 1, 2024, 7:48 PM

#

They don't seem to, though they do call it Mamba-S6 (there is a Mamba-S4 variant they propose, too)

#

Remark 3.1. For brevity in our experimental results, we sometimes abbreviate selective SSMs as S6 models, because they
are S4 models with a selection mechanism and computed with a scan.

#

Technically they call Mamba the architectural layout, and S6 the [now selective] SSM mechanism

steady ether Feb 1, 2024, 7:50 PM

#

Can we use an acronym, so that eagle would actually mean something. Maybe:

EAGLE = Efficient Artificial Generative Language Engine/Expert

misty igloo Feb 1, 2024, 7:51 PM

#

steady ether Can we use an acronym, so that eagle would actually mean something. Maybe: EAGL...

oh boy, I hope we don't have to come up with a new reason for every bird we use 🤣 like Finch even in this paper

rose mango Feb 1, 2024, 7:54 PM

#

misty igloo oh boy, I hope we don't have to come up with a new reason for every bird we use ...

ask ChatGPT to expand it

last mauve Feb 1, 2024, 7:54 PM

#

@everyone -- Also, just to be explicit, authorship is purely merit-based again. You don't get free authorship as just an RWKV code contributor or as an author on the RWKV-v4 paper, including me.

Similarly to RWKV-v4, authors will be decided based on who meaningfully improves the paper itself. Some examples of authorship:

Writing a paper section explaining yours or someone else's code in a meaningful way
Taking results and plotting them
Meaningfully improving the paper writing (e.g. significant revisions, rewrites, etc)

What won't count as authorship:

Pure proofreading
Being an RWKV code contributor without your contribution ending up in the paper
Just discord discussions or leaving paper comments

In short, we need to be able to write an "Author Contributions" section for you with some meaningful content A bunch of examples are in the RWKV-v4 paper's appendix B.

In general, the bar for authorship is not terribly high to encourage community involvement, but the bar will be there nonetheless to deter those trying to exploit and I will enforce it.

last mauve Feb 1, 2024, 7:55 PM

#

spiral minnow I agree, I think we can introduce it as RWKV-5 Eagle, and from there on out, jus...

I like this idea via@young sparrow 's phrasing. Let's go with that.

rose mango Feb 1, 2024, 7:55 PM

#

last mauve We may also be able to differentiate ourselves here by adding practical details ...

Open source training & inference code, open dataset (perhaps whether the hyperparameters used are included as well?), total tokens, context length

last mauve Feb 1, 2024, 7:56 PM

#

misty igloo might that cause an anonymity problem phrased that way?

Any group publishing on a persistent project faces this. We won't explicitly state we're from the RWKV team, but you're right that it will be obvious. Nothing we can do about that.

last mauve Feb 1, 2024, 7:59 PM

#

rose mango Open source training & inference code, open dataset (perhaps whether the hyperpa...

updated #1103039376184852622 message

rose mango Feb 1, 2024, 8:02 PM

#

Excellent. I'll work on putting that together after my class.

Main models to compare would probably be Facebook's LLaMA series, Mistral, Phi(maybe?), and possibly even OAI's GPT-4

alpine ferry Feb 1, 2024, 8:19 PM

#

do we have any target date in mind when we plan to publish/arxiv the paper we are editing on overleaf?

last mauve Feb 1, 2024, 8:22 PM

#

alpine ferry do we have any target date in mind when we plan to publish/arxiv the paper we ar...

End of Feb: #1103039376184852622 message

last mauve Feb 1, 2024, 8:22 PM

#

last mauve @everyone -- Also, just to be explicit, authorship is purely merit-based again. ...

steady ether Feb 1, 2024, 8:32 PM

#

Added a short blurb on associative recall tasks. Assuming my RWKV-5 code is functioning, as the zoology authors reviewed the changes. They mentioned the possibility of sharing wandb logs for their other experiments after the ICML deadline.

void quartz Feb 2, 2024, 6:52 AM

#

btw - is there any known test suites that is broken?
i realise it was probably a dumb idea to do a* b* ... only to come back and see some tasks having errors

#

and not having any output, as 1 failed

void quartz Feb 2, 2024, 9:05 AM

#

undone solstice Hi, thanks for compiling this TODO list, I very much would like to contribute to...

i would say the needle in heystack - using claude format
https://github.com/Arize-ai/LLMTest_NeedleInAHaystack2

GitHub

GitHub - Arize-ai/LLMTest_NeedleInAHaystack2: Doing simple retrieva...

Doing simple retrieval from LLM models at various context lengths to measure accuracy - GitHub - Arize-ai/LLMTest_NeedleInAHaystack2: Doing simple retrieval from LLM models at various context lengt...

#

though we might need an instruct tune first - but might be good to know the baseline as well

undone solstice Feb 2, 2024, 9:18 AM

#

void quartz i would say the needle in heystack - using claude format https://github.com/Ariz...

Thanks, I can try this out, will share my findings here. Do you think code generation benchmark is worth doing? I can also help with evals on HumanEval

void quartz Feb 2, 2024, 9:19 AM

#

im taking the approach of trying to run as much as possible first, then leave it to the more experienced authors to decide - so sure to humaneval haha

gusty condor Feb 2, 2024, 12:10 PM

#

I have Evals on AlignBench (Chinese alignment)

real warren Feb 3, 2024, 1:23 PM

#

I was talking over at #992359629419991142 about Eagle and wondering about the out-of-the-box Machine Traslation capabilities of these new RWKV-X models against SOTA based LLMs systems. I may have some time during this month to try some eval. Is there somewhere more info about the dataset used (and possible language coverage), since I doesn't seem to be at the Overleaf doc at this moment. Want to know so that there aren't any kind of data leakages on my initial tests.

obsidian quest Feb 3, 2024, 3:04 PM

#

real warren I was talking over at <#992359629419991142> about Eagle and wondering about the ...

I added Tatoeba

real warren Feb 3, 2024, 3:09 PM

#

obsidian quest I added Tatoeba

https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt all language pairs?

Helsinki-NLP/tatoeba_mt · Datasets at Hugging Face

obsidian quest Feb 3, 2024, 3:29 PM

#

real warren https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt all language pairs?

yes. should be clean from other translation datasets.

real warren Feb 3, 2024, 4:25 PM

#

obsidian quest yes. should be clean from other translation datasets.

What's there any particular format usage of the Instruction/Input/Response (https://huggingface.co/RWKV/HF_v5-Eagle-7B) prompt style of Tatoeba pairs at train time?

RWKV/HF_v5-Eagle-7B · Hugging Face

#

Asking since in my experience using the same prompt for possible MT pairs seen at training during evaluation seems to better bring to light innate translation capabilites of LLMs

obsidian quest Feb 3, 2024, 4:52 PM

#

sth like

English: xxx
French: xxx

acoustic knoll Feb 3, 2024, 5:01 PM

#

real warren I was talking over at <#992359629419991142> about Eagle and wondering about the ...

Hi, how do you evaluate the MT? I finetuned the smaller RWKV-v5 models (1.5B and 3B) with 40B tokens of my language, including the parallel corpora (tatoeba and wikihow), so I want also to evaluate its translation capability and compare it with other MT models (marian,..).

real warren Feb 3, 2024, 5:22 PM

#

acoustic knoll Hi, how do you evaluate the MT? I finetuned the smaller RWKV-v5 models (1.5B and...

I was thinking of a preliminary evaluation on sentence level evaluation (with k-shots) with the latest test-sets from WMT23 and Flores evaluating traditional n-gram matching metrics BLEU/chrF++ with sacreBLEU (https://github.com/mjpost/sacrebleu) and a newer (and more recommended) neural metric like COMETX (https://github.com/Unbabel/COMET). I was also thinking of using the recent tower-eval eval suit from Unbabel (https://huggingface.co/datasets/Unbabel/TowerEval-Data-v0.1).

GitHub

GitHub - mjpost/sacrebleu: Reference BLEU implementation that auto-...

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons - GitHub - mjpost/sacrebleu: Reference BLEU implementation that auto-dow...

#

As for baselines, I was thinking of testing some SOTA multilingual MT Enc-Dec like NLLB and some dec-only model like Tower/ALMA-R (Llama2 variants) or GPT-4

acoustic knoll Feb 3, 2024, 6:41 PM

#

Thanks a lot, I will have a look.

#

From my experience, rwkv can translate very well sentences up to one or two short paragraphs. But, the translation result is getting worse with more and longer paragraphs

obsidian quest Feb 3, 2024, 8:45 PM

#

acoustic knoll From my experience, rwkv can translate very well sentences up to one or two shor...

i only trained translation of several sentences or 1-2 paragraphs. split your text into chunks

acoustic knoll Feb 3, 2024, 9:27 PM

#

obsidian quest i only trained translation of several sentences or 1-2 paragraphs. split your t...

Yes that’s fine because the tatoeba parallel corpora has only few sentences for each entry.

void quartz Feb 4, 2024, 2:28 AM

#

~~can we add support for temp=0 into the inference code, cause several benchmarks rely on that~~
Saw on HF - they recommend fixing those benchmarks

obsidian quest Feb 4, 2024, 5:14 AM

#

obsidian quest try this latest improvement for v5 v6 if you have compute: change gate to d=64 l...

after using lora for gate, this is better for me (same params count):
args.dim_att = args.n_embd * 3 // 2
args.dim_ffn = args.n_embd * 3
UPDATE: it's worse after training for 1G+ tokens

tropic minnow Feb 4, 2024, 1:11 PM

#

thoughts on this? feedback welcome

Captura_de_Pantalla_2024-02-04_a_las_14.11.05.png

#

my main insecurity is the W part. tried to picture the (maybe=v6) W dependence on the data, as well as dependence on W_{t-1} due to the product.

#

also suggestions for a better sign than "@" for matmul are welcome. I tought about "X" but we used that to denote element-wise product in RWKV4. so if we dont modify that to the circle-dot, i dont feel comfortable using for smth else here as ppl will put the 2 figs side by side to see whats changed

#

it's basically intended to replace the left diagram in the rwkv-v4 figure:

Captura_de_Pantalla_2024-02-04_a_las_14.16.55.png

gusty condor Feb 4, 2024, 1:18 PM

#

I think that we could use an entirely new diagram for better representation.

#

Some details like "time-first" u are ignored in the diagram above

gusty condor Feb 4, 2024, 1:50 PM

#

obsidian quest after using lora for gate, this is better for me (same params count): args.dim_a...

I remembered you wrote somewhere that large att and small ffn performs well at the beginning but raises problems in later training.

obsidian quest Feb 4, 2024, 2:22 PM

#

gusty condor I remembered you wrote somewhere that large att and small ffn performs well at t...

yeah it's still worse

tropic minnow Feb 4, 2024, 3:04 PM

#

gusty condor Some details like "time-first" `u` are ignored in the diagram above

well same applies to our previous figure. We can always say u is accounted for in the WKV term, and refer readers to the equations. in the rnn formulation we can represent u better

rose mango Feb 4, 2024, 3:43 PM

#

tropic minnow also suggestions for a better sign than "@" for matmul are welcome. I tought abo...

⊛ ?

rose mango Feb 4, 2024, 4:12 PM

#

basically + and x together

tropic minnow Feb 4, 2024, 4:16 PM

#

Captura_de_Pantalla_2024-02-04_a_las_17.15.39.png

misty cedar Feb 5, 2024, 3:53 AM

#

https://twitter.com/_akhaliq/status/1754334655405326482
can someone double check this? it looks like they are claiming that mamba only has a perfect token memory of 55?
we have the data showing at least a 2.2k for v5

AK (@_akhaliq) on X

Repeat After Me

Transformers are Better than State Space Models at Copying

paper page: https://t.co/OzOXqYQy6I

Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on…

void quartz Feb 5, 2024, 4:01 AM

#

misty cedar https://twitter.com/_akhaliq/status/1754334655405326482 can someone double check...

tbh 55 sounds way too low - i suspect they did purely random characters - instead of the random words test i did

void quartz Feb 5, 2024, 5:58 AM

#

Automated lm-eval for distributed evals!
current status on evals : I fully automated the eval run, and collection of data via GH actions - and is distributed across 5 nodes of 4/8 x 3090/4090/A5000

should also make it easier to plug in any HF compatible model for eval as well

this is the first cut, on the 1B5 model, no few shot
will drop out any of the evals that crash (temp=0, eval files are 404, etc), before moving to 3B and 7B

#

link: https://github.com/RWKV/lm-evaluation-harness/actions/runs/7778932288

ps: if anyone wants to throw 3090+ class GPUs on this, DM me, i can message you the docker container run script (and key) to add to the GH actions pool

#

if you want to see an example of all the output files you can refer to a previous aborted run here: https://github.com/RWKV/lm-evaluation-harness/actions/runs/7771205514
scroll all the way to the bottom for the output files (incomplete)

gusty condor Feb 5, 2024, 3:51 PM

#

tropic minnow

What does the circular arrow pointing to W itself mean?

tough crane Feb 5, 2024, 6:28 PM

#

last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

I'll taking 4. now

last mauve Feb 5, 2024, 6:29 PM

#

tough crane I'll taking 4. now

added!

burnt cedar Feb 5, 2024, 6:35 PM

#

misty cedar https://twitter.com/_akhaliq/status/1754334655405326482 can someone double check...

Citations point to v4 paper

#

As mentioned in the rwkv server, 55 points closer to v4?

#

Could just be that

misty cedar Feb 5, 2024, 6:36 PM

#

burnt cedar Citations point to v4 paper

I meant, does that paper point to mamba having v4 levels of perfect memory

burnt cedar Feb 5, 2024, 6:37 PM

#

misty cedar I meant, does that paper point to mamba having v4 levels of perfect memory

It does say that, which is strange

#

Lstms and mamba have the same exact scores

misty cedar Feb 5, 2024, 6:48 PM

#

burnt cedar It does say that, which is strange

so either there is a problem with the methodology, or mamba is actually pretty bad compared to v5

burnt cedar Feb 5, 2024, 6:49 PM

#

misty cedar so either there is a problem with the methodology, or mamba is actually pretty b...

Maybe, needs more tests, though the title and classifications point me towards the former

misty cedar Feb 5, 2024, 6:52 PM

#

burnt cedar Maybe, needs more tests, though the title and classifications point me towards t...

idk, we could spend a bunch of compute to try defend mamba, or we could use this as a source and save compute when showing the perfect memory graphs for the v5 paper...
academic integrity is expensive i guess

burnt cedar Feb 5, 2024, 6:55 PM

#

misty cedar idk, we could spend a bunch of compute to try defend mamba, or we could use this...

True, this would be a great way to prove v5, but feels bad left unverified

young sparrow Feb 5, 2024, 6:57 PM

#

@misty cedar There are many possibilities other than it being a "hit piece" and making such groundless accusations creates hostility for no reason. Do not say things like this without real evidence.

misty cedar Feb 5, 2024, 6:58 PM

#

young sparrow <@438605535323881486> There are many possibilities other than it being a "hit pi...

good point, we should be working on this collaboratively

#

Zoology group also has some data in that area

young sparrow Feb 5, 2024, 6:59 PM

#

This group as a whole has a serious problem with accusing people of acting in bad faith, sometimes even on the grounds of finding different results. It's extremely disappointing.

#

Let's work on having a more positive and less hostile attitude towards other research groups. What we're doing is very hard, very finicky, and contradictory results crop up all the time. It takes careful and collaborative work to figure out why. They're doing the best they can, just like we are.

#

In terms of what results to trust, I would recommend thinking about what benchmarks we find the most reliable and trust the results using those benchmarks. I've been really impressed by infinity bench recently, which contains a diverse collection of real and artificial long context tasks. But it's also totally fine to say "we're using this methodology, others exist that might be better" and worry about those in a future paper if we've already done a lot of work.

rose mango Feb 5, 2024, 8:13 PM

#

There's no motivation for most people to act in bad faith anyway

#

I haven't fully read the paper, but their results don't seem surprising to me.

Random sequences are hard (impossible, if truly random) to compress, and storing information in a fixed-size state is effectively a form of lossy compression.

#

The flip side is that no human is going to remember a long random sequence well either

tough crane Feb 5, 2024, 9:07 PM

#

@obsidian quest

Could you tell me the all available downstream tasks v5.2 and v6 for each FLOPS needed to be trained ?

My intention is to collect plots for Figure 5 in the RWKV4 manuscript.

rose mango Feb 5, 2024, 9:31 PM

#

Openness/accessibility comparison table with other models is largely complete

last mauve Feb 5, 2024, 10:09 PM

#

rose mango Openness/accessibility comparison table with other models is largely complete

I love it, but I'm a bit confused on the partially open dataset point. Why are we claiming the dataset is partially open?

rose mango Feb 5, 2024, 10:13 PM

#

last mauve I love it, but I'm a bit confused on the partially open dataset point. Why are w...

AFAIK we haven't shared the dataset mixture or the full list of everything included

last mauve Feb 5, 2024, 10:13 PM

#

rose mango AFAIK we haven't shared the dataset mixture or the full list of everything inclu...

So it's closed

rose mango Feb 5, 2024, 10:18 PM

#

The pile, slimpajama, all of the wikipedias, OSCAR, and starcoder are what's being used IIRC

Do we plan on releasing the dataset (or sharing the composition)?

void quartz Feb 5, 2024, 10:25 PM

#

misty cedar idk, we could spend a bunch of compute to try defend mamba, or we could use this...

i think its best for us to actually train and test mamba - at 3B its at 3090/4090 class compute, and i think i can afford that (or better yet, work with them on this)

#

i honestly think the paper might be harsh against them as well (cause 55 character feels too low, and i believe mamba can achieve better)

#

i cant seem to find the full replication details however - so my experiment methodology is probably different from what they did

#

my results for my memory test (similar to how the paper was structured, finetune the model to repeat) 2 weeks ago showed that 3B ( https://github.com/RWKV/RWKV-infctx-trainer/blob/rwkv-x-eagle-notebooks/notebook/rwkv-x-exp/v5-exp/memory-test/World-3B-mem-finetune.ipynb )

about 2.2k matched tokens in memory (at 90% match rate),
or 525 matched tokens in memory (at 100% match rate)

previous discussion with various folks here (off this channel), was that we expect mamba to have similar memory capacity not worse

the only reason i can think of for that paper, it was using pure random characters - while i was using randomized dictionary words - and that might change the score?

#

willing to collaborate with mamba team on their tune cause they would know best on how to finetune their model to replicate this ( whats the best way to coordinate this? , alternatively we could talk to the original paper team as well )

void quartz Feb 5, 2024, 10:54 PM

#

Also: Our current recent growth, is thanks in many parts to mamba

This is a personal opinion

It may sound dumb, but it has been a really huge tone change since mamba came out, people take us way more seriously now. People no longer dismiss alternative architecture as a "pointless effort" or "not worth talking about"

Conversations flow faster, we get to focus on how we are different from mamba/transformers in good ways.

Sure, a small part of it may have been a case of a big name university setting the tone for us "random folks" on the internet, and giving us credibility - a situation that i know drive frustration to many in the RWKV group, as it can feel unfair (as the core work on rwkv has remained the same) - when mamba gets the limelight

But we have to remember the statespace team (and other teams) did not choose for this social situation, where they get more credibility by being associated to a major university / prof - And this limelight takes turns - Eagle now gets the spotlight, from that momentum

In the very same lens, a parallel story might be played out now (diffusion text model, maybe?) by an even more random team on the internet, against us having the credibility / attention due to the association with EleutherAI and LF - folks who may face the very same frustrations we previously faced (why bother competing with RWKV or Mamba?) as they try to prove out their architecture

So lets co-op in good faith?
( to mamba, and transformer folks )

void quartz Feb 5, 2024, 11:24 PM

#

@last mauve / @misty igloo - do you think it makes sense if we create a subgroup on the evals? - i think there is a long discussion on its own of what benchmark to include or exclude - i have the full lm-eval list reduced down to what we can run (almost), and is probably more then what we need (ethical and alignment evals??)

After that, its simply scaling it up, and running it across the select models we want to compare against

I should also probably compile a list of evals that might need fixing, a bunch of them have 404 or missing datasets (and file as a bug report to lm-eval)

last mauve Feb 5, 2024, 11:30 PM

#

RWKV-papers

last mauve Feb 5, 2024, 11:31 PM

#

void quartz <@367104793292046338> / <@1007072846960410685> - do you think it makes sense if ...

Yeah I've been thinking the same on that. Lemme create a new channel under NLP.

void quartz Feb 5, 2024, 11:51 PM

#

Fro those interested in helping hop into the evals, its here :
#rwkv message

Im gonna spin up more 3090s to start eating these benchmarks up 🙂

rose mango Feb 6, 2024, 12:26 AM

#

void quartz --- **Also: Our current recent growth, is thanks in many parts to mamba** > This...

big name university

I don't even think the positive reception was simply because of Stanford, but largely because Tri Dao was involved. They also immediately published an easy-to-use optimized library that let others use Mamba blocks within their own models.

void quartz Feb 6, 2024, 12:27 AM

#

rose mango >big name university I don't even think the positive reception was simply becau...

agreed - we have lots of work to do in making our various modules simpler and easier to work with for others to grab their hands on - and play with

rose mango Feb 6, 2024, 12:28 AM

#

when people can do from rwkv.simple import RWKVBlock, then we are there

void quartz Feb 6, 2024, 12:28 AM

#

haha, but tbh - its not just that, its lots of the small things

#

but i dun want to tangent too far here (as its no longer about the paper), my point was to call out the sentiments i see in here, and the RWKV discord - we are gaining momentum - we simply need to keep doing our best to get better

void quartz Feb 6, 2024, 1:19 AM

#

misty cedar https://twitter.com/_akhaliq/status/1754334655405326482 can someone double check...

managed to get some replies from the paper author (twitter)

the 55 character model, was a 160M model they trained from scratch
they did additional experiments for the pre trained 360M / 1.4B / 2.8B, which performed much better (100+ token), i requested for the table data (as the graph is hard to read)

Important to note for when we did the "from scratch" train, without doing an enwiki pretrained, our model for "some reasons" perform terribly for the memory task as well (this defy transformer conventions) - they did not consider pretraining it with enwiki, might be a influencing factor

The subsequent tests, are not finetuned varients, so its not apple to apple either to our numbers (we might be at similar perf levels)

misty igloo Feb 6, 2024, 6:00 AM

#

rose mango The pile, slimpajama, all of the wikipedias, OSCAR, and starcoder are what's bei...

@void quartz maybe you can speak to this? It is whatever it is, we just gotta decide if we mention what the dataset consists of in the paper or if we skip that for this one

void quartz Feb 6, 2024, 6:05 AM

#

Pile + Books (Book3, gutenberg) + SlimPajama + StarCoder + OSCAR + All_Wikipedia

Open Instruct (which is probably where the contamination came from)

#

As to which exact slice of all the data, only blink knows

misty igloo Feb 6, 2024, 6:09 AM

#

void quartz Pile + Books (Book3, gutenberg) + SlimPajama + StarCoder + OSCAR + All_Wikipedi...

@last mauve where does that leave us on dataset openness in your opinion?

void quartz Feb 6, 2024, 6:10 AM

#

im of oppinion that open dataset, is in the direction of repoducibility

#

this does not fit that criteria

misty igloo Feb 6, 2024, 6:11 AM

#

Agreed, but that's why @rose mango had it listed as partial in table 2

#

Unlike mistral etc who don't even disclose what's in the data

void quartz Feb 6, 2024, 6:12 AM

#

or token count Q.Q

misty igloo Feb 6, 2024, 6:12 AM

#

That too

#

In any case we should add this list to the paper

tropic minnow Feb 6, 2024, 7:58 AM

#

gusty condor What does the circular arrow pointing to W itself mean?

the w^{i-j} in rwkv5 and the cumprod (which is data-dependent) in rwkv6

obsidian quest Feb 6, 2024, 11:16 AM

#

rose mango The pile, slimpajama, all of the wikipedias, OSCAR, and starcoder are what's bei...

i added plenty of chatgpt data on hf too

young sparrow Feb 6, 2024, 1:42 PM

#

misty igloo <@644428303293349888> maybe you can speak to this? It is whatever it is, we just...

We will be disclosing the data. It is a violation of both Linux Foundation and EleutherAI policy to not do so. Keeping it secret has never been an option.

Furthermore, I don't see why anyone would want to not disclose the data. We are training on very standard datasets it seems... all not disclosing the data will do is make people wonder if we are cheating by training on the test sets.

young sparrow Feb 6, 2024, 1:44 PM

#

obsidian quest i added plenty of chatgpt data on hf too

We need a list of which repos you used.

burnt cedar Feb 6, 2024, 4:51 PM

#

void quartz Pile + Books (Book3, gutenberg) + SlimPajama + StarCoder + OSCAR + All_Wikipedi...

We list this, but doesn't someone have the dataset available as json

#

Wouldn't this probably be distributable now?

rose mango Feb 6, 2024, 7:44 PM

#

I saw the mamba paper was rejected. I have no idea why.

#

There doesn't seem to be anything wrong with it

rose mango Feb 6, 2024, 9:10 PM

#

last mauve **6.** Long-context and inference speed benchmarks need added. These need compar...

I can also do the chat example comparison

tough crane Feb 6, 2024, 9:17 PM

#

last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

@last mauve

IMHO, I wonder whether several data might be missed for plotting Fig 5.

On RWKV v5:
- the number of training tokens because IMHO x-axis in fig 5 is : num_trained_tokens * factor_to_backward * flops_in_table_3
- Factor to calculate backward FLOPS ( transformer FLOPS calculation tools set to 2.0 as default https://huggingface.co/spaces/MrYXJ/calculate-model-flops )

I'm asking @obsidian quest and current status is waiting. I could help other tasks: #1 or adding multilingual benchmark results.

Calculate Model Flops - a Hugging Face Space by MrYXJ

alpine ferry Feb 7, 2024, 6:26 PM

#

young sparrow We will be disclosing the data. It is a violation of both Linux Foundation and E...

💯

last mauve Feb 8, 2024, 12:08 AM

#

misty igloo <@367104793292046338> where does that leave us on dataset openness in your opini...

It's not open unless someone can reproduce, meaning this dataset needs released. This requirement is enforced by the LF anyway as @young sparrow mentioned so this goes beyond just the table.

last mauve Feb 8, 2024, 12:09 AM

#

rose mango Openness/accessibility comparison table with other models is largely complete

I went ahead and added a checkmark for Eagle's dataset in anticipation of this

last mauve Feb 8, 2024, 12:11 AM

#

tough crane <@367104793292046338> IMHO, I wonder whether several data might be missed for ...

Wait do you mean RWKV-v4 arxiv's fig. 5? I'm not sure I understand your point on FLOPs or what's missing.

Multilingual benchmark results (and evals/scaling plots in general) would be the most impactful thing to help with rn. Can you follow up with that on #rwkv ?

last mauve Feb 8, 2024, 12:13 AM

#

tropic minnow

I'm still really bullish on creating some simplified subfigs to break up figure 1. Did you need further discussion here @tropic minnow ?

tropic minnow Feb 8, 2024, 7:13 AM

#

last mauve I'm still really bullish on creating some simplified subfigs to break up figure ...

Nice! Will do extra diagrams for the token shift and W lora. And think about the WKV

void quartz Feb 8, 2024, 8:45 AM

#

For the folks who need benchmark figures, over 72 benchmarks tasks have been done for eagle 1.5B -> 3B -> 7B here, in bf16 mode: #1204211116268462150 message

#

i can rerun this in fp16 mode if needed, would like to know what models i should be running next to compare against - currently i have / is getting the numbers for

Mistral 7B
Falcon 7B
MPT 7B

void quartz Feb 8, 2024, 8:50 AM

#

last mauve Wait do you mean RWKV-v4 arxiv's fig. 5? I'm not sure I understand your point on...

i gotten all the multi-lang bench done as well 🙂
i can start extracting the numbers that is needed - just let me know which one in the list

gusty condor Feb 8, 2024, 3:20 PM

#

Some benchmarks are slightly better than random

tough crane Feb 8, 2024, 5:28 PM

#

gusty condor Some benchmarks are slightly better than random

Is it meaning terrible bad?

acoustic knoll Feb 8, 2024, 5:55 PM

#

gusty condor Some benchmarks are slightly better than random

Is it mmlu?

subtle oak Feb 8, 2024, 6:36 PM

#

The MMLU seems pretty bad on RWKV-4 before... I found this in TransNormer paper

#

looks like they benchmarked the MMLU in RWKV-4

Screenshot_2024-02-08_at_10.38.17_AM.png

#

I do not sure that if we face the same problem again

obsidian quest Feb 8, 2024, 6:43 PM

#

v5 7b is better at mmlu

#

https://huggingface.co/spaces/devingulliver/subquadratic-llm-leaderboard

Subquadratic LLM Leaderboard - a Hugging Face Space by devingulliver

burnt cedar Feb 8, 2024, 7:15 PM

#

subtle oak looks like they benchmarked the MMLU in RWKV-4

Also consider that this is the base model without mmlu fine-tune with v4

void quartz Feb 9, 2024, 12:13 AM

#

void quartz i can rerun this in fp16 mode if needed, would like to know what models i should...

@last mauve - who decides which models should be included for compare? Cause i need a candidate list to start running

#

(finishing v4 benchmarks)

last mauve Feb 9, 2024, 12:14 AM

#

void quartz <@367104793292046338> - who decides which models should be included for compare?...

There's no set authority, but I can help form a sensible list

#

Llama 1/2
Mistral 7B
Falcon 7B
MPT 7B
Pythia 6.9B
GPT-J
OPT-6.7B
BLOOM 7.1B
OLMo-7B
RedPajama-INCITE-7B

void quartz Feb 9, 2024, 12:18 AM

#

how bout the 3B / 1.5B class?

burnt cedar Feb 9, 2024, 2:06 AM

#

void quartz how bout the 3B / 1.5B class?

Tinyllama, phi 1 1.5 2, falcon rw, olmo, pythia

#

Basically blink has been comparing many top tier for the new finch benchmarks

burnt cedar Feb 9, 2024, 4:07 AM

#

@void quartz about the needle in a haystack test and extrapolation

#

https://twitter.com/PY_Z001/status/1755530398619382207

Zhang Peiyuan (@PY_Z001) on X

We've been exploring context extrapolation with Mamba and managed to make it (state-spaces/mamba-2.8b-slimpj) retrieve nearly perfectly on a window of 16384.

Here's a brief overview of what we've found so far:

#

Some results for mamba

#

It's showing the same ppl explosion as rwkv

#

Similar to v4

#

Looks like v5 tends to extrapolated better

void quartz Feb 9, 2024, 4:15 AM

#

burnt cedar Similar to v4

this is clearly better then v4 =x

burnt cedar Feb 9, 2024, 4:16 AM

#

void quartz this is clearly better then v4 =x

Oh of course it's better

void quartz Feb 9, 2024, 4:16 AM

#

btw u can see the convo here : #rwkv message
for the test we need to do haha

burnt cedar Feb 9, 2024, 4:16 AM

#

I meant the ppl explosion

void quartz Feb 9, 2024, 4:16 AM

#

ahhh yea ok that is the same

#

yea v5 seems more stable even beyond trained length

gusty condor Feb 9, 2024, 5:06 AM

#

void quartz yea v5 seems more stable even beyond trained length

0.4B and 1.5B are stable at context length ~48k or more, using parallel scanning (memory usage O(n)). Haven't tested RNN mode yet, it takes too long.

tropic minnow Feb 9, 2024, 9:53 AM

#

2 options for tokenshift. thoughts?

Captura_de_Pantalla_2024-02-09_a_las_10.52.19.png

Captura_de_Pantalla_2024-02-09_a_las_10.52.15.png

tropic minnow Feb 9, 2024, 9:53 AM

#

tropic minnow

along the lines of this

tropic minnow Feb 9, 2024, 9:53 AM

#

last mauve I'm still really bullish on creating some simplified subfigs to break up figure ...

like this?^^^

steady ether Feb 9, 2024, 4:18 PM

#

tropic minnow 2 options for tokenshift. thoughts?

Left if everything were more centered and the title remained on one line. Otherwise, right feels cleaner.

misty igloo Feb 9, 2024, 7:08 PM

#

tropic minnow 2 options for tokenshift. thoughts?

I think the left one makes it clearer that mu isnt more favored than 1-mu

#

and i'd maybe use \in \mathbb{R}^{LxD} instead of superscript so its clear what the LxD and 1xD mean

obsidian quest Feb 10, 2024, 2:33 PM

#

v6 training code uploaded to https://github.com/BlinkDL/RWKV-LM
use /RWKV-v5/ and add --my_testing "x060" to demo-training-prepare.sh and demo-training-run.sh

GitHub

GitHub - BlinkDL/RWKV-LM: RWKV is an RNN with transformer-level LLM...

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast in...

tropic minnow Feb 10, 2024, 3:31 PM

#

incorporated suggestions @last mauve @misty igloo

Captura_de_Pantalla_2024-02-10_a_las_16.31.25.png

#

this would be for the MLP version (inherited from rwkv 4) and for the V5. will do the new ddlerp+lora (V6) now

young sparrow Feb 10, 2024, 3:41 PM

#

@obsidian quest do we have the compute to do a scaling laws search like we did for the previous paper?

obsidian quest Feb 10, 2024, 3:42 PM

#

unfortunately i dont have the compute at this moment

young sparrow Feb 10, 2024, 3:45 PM

#

How much did the scaling laws run you did for the previous paper require?

tropic minnow Feb 10, 2024, 4:46 PM

#

tropic minnow this would be for the MLP version (inherited from rwkv 4) and for the V5. will d...

okay here it is: the 2 in the left are V5 (no lora, no data dependence); while the one in the right is V6 (data-dependent lerp)

Captura_de_Pantalla_2024-02-10_a_las_17.45.42.png

tropic minnow Feb 11, 2024, 2:03 PM

#

thoughts?

Captura_de_Pantalla_2024-02-11_a_las_15.03.12.png

young sparrow Feb 11, 2024, 5:33 PM

#

Someone had asked if I had the code for the plots in the RWKV paper. I have the code that produced the scaling laws plots but not the plotting of evaluation results. It would be quite easy for me to recreate the code though, if its desired. Just let me know what is needed.

obsidian quest Feb 11, 2024, 5:57 PM

#

i am using this for evals https://github.com/BlinkDL/ChatRWKV/blob/main/run_lm_eval.py and use [0] for RWKV_PAD. you can verify my eval results first

"\n" was used for rwkv4 evals

GitHub

ChatRWKV/run_lm_eval.py at main · BlinkDL/ChatRWKV

ChatRWKV is like ChatGPT but powered by RWKV (100% RNN) language model, and open source. - BlinkDL/ChatRWKV

young sparrow Feb 11, 2024, 6:26 PM

#

obsidian quest i am using this for evals https://github.com/BlinkDL/ChatRWKV/blob/main/run_lm_e...

Now that we officially support RWKV in the evaluation harness, can you please use that instead? I worry about minor divergences between the codebases causing inconsistencies. Plus it makes reproducibility far easier if everyone is using the same codebase.

misty igloo Feb 11, 2024, 7:14 PM

#

tropic minnow thoughts?

love it, maybe w should look like g,r,k,v? and then lead to a exp(-exp()) block and then a * circle

young sparrow Feb 11, 2024, 7:17 PM

#

tropic minnow thoughts?

Why is X_t both an input and an output? I assume that's a mistake?

misty igloo Feb 11, 2024, 7:25 PM

#

young sparrow Why is X_t both an input and an output? I assume that's a mistake?

we actually do use it as 'state' for the next iteration

#

to support tokenshift

#

that's where the X_{t-1} comes in on the left

young sparrow Feb 11, 2024, 7:28 PM

#

I see, so that represents a residual connection

#

And this diagram computes h not x, u seem to have missed that

misty igloo Feb 11, 2024, 8:04 PM

#

young sparrow I see, so that represents a residual connection

sorry, maybe I was unclear or misunderstood - it's not residual, we store x_t for use in the next iteration (timestep) where it comes in again, like if you put copies of these blocks side by side left to right

young sparrow Feb 11, 2024, 8:04 PM

#

misty igloo sorry, maybe I was unclear or misunderstood - it's not residual, we store x_t fo...

Yes, I understood

misty igloo Feb 11, 2024, 8:08 PM

#

young sparrow Yes, I understood

do you think we should remove the x_t 'state' output to the right?

young sparrow Feb 11, 2024, 8:12 PM

#

No I think it's good now that I have my head screwed on correctly

gusty condor Feb 12, 2024, 6:23 AM

#

tropic minnow thoughts?

It's not really DxD, I used (D/h) x (Dxh) in the paper

tropic minnow Feb 12, 2024, 7:53 AM

#

gusty condor It's not really DxD, I used (D/h) x (Dxh) in the paper

Yes in theory it is multi-head but all drawings are for single head for simplicity

void quartz Feb 12, 2024, 6:01 PM

#

For few shot tests? Which should be covered and how many shots?

#

( realised I missed that )

tough crane Feb 13, 2024, 8:36 PM

#

void quartz For few shot tests? Which should be covered and how many shots?

IMHO, I personally think that we will run experiments which reviewer du8a of Mamba paper pointed out.

The reviewer also said that the authors should only show results on zero-shot inference.

There are many works following the same direction: S4-diagonal [1], SGConv [2], MEGA [3], SPADE [4], and many efficient Transformer models (e.g., [5]). All of these models achieve near linear complexity, and the authors need to compare Mamba with these works in terms of both model performance and efficiency. For model performance, some simple experiments such as language modeling on Wikitext-103 should suffice.
Because SSMs are in general sequential, does Mamba have this length generalization ability?
I suggest the authors run more long-sequence experiments such as document summarization, where the input sequence is naturally long (e.g., the average sequence length of the arXiv dataset is greater than 8k).

https://openreview.net/forum?id=AL1fq05o7H

OpenReview

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many...

void quartz Feb 13, 2024, 8:42 PM

#

tough crane IMHO, I personally think that we will run experiments which reviewer du8a of M...

for tests outside of lm-evals, we can add that seperately - im more focused in getting all the data we need in lm-eval quickly

#

(needle in heystack, etc)

tough crane Feb 13, 2024, 9:02 PM

#

void quartz for tests outside of lm-evals, we can add that seperately - im more focused in g...

To compare accuracy based on FLOPS as the way in RWKV4 paper, we have to run evals for RWKV5 checkpointed models trained on up to 330B tokens for each params 169m, 430m, 1.5B, 3B, 7B.

I think that 1.12 T tokens are used to train v5 for one epoch.

OPT : trained on 180B tokens for params up to 12B
Pythia : trained on 300B tokens for params up to 12B params
BLOOM : trained on 341+25=366B tokens for params up to 12B params
RWKV-4 : trained on 330B tokens for params up to 14B params

#

params : 169m, 430m, 1.5B, 3B, 7B
tasks : lambada, piqa, winogrande, sciq, arc_easy, arc_challenge
checkpoints : some step such that at most 360B tokens are inputted into.

tough crane Feb 13, 2024, 9:29 PM

#

last mauve Wait do you mean RWKV-v4 arxiv's fig. 5? I'm not sure I understand your point on...

@last mauve As the above comment, to compare accuracy based on FLOPS as the way in RWKV4 paper, we have to run evals for RWKV5 checkpointed models trained on up to 330B tokens. However, the current checkpoint weights seems to be trained on 1.12 T tokens.

I'm asking picocreator.

last mauve Feb 14, 2024, 1:45 AM

#

tough crane <@367104793292046338> As the above comment, to compare accuracy based on FLOPS ...

don't we have regular checkpoints from the entire training run?

and if we only care about the FLOPs of the final ckpts, then we just have to compare the 1.12T token ckpts to models with comparable FLOPs. To help me clarify what you want, which RWKV4 paper plot are you referring to recreating here

gusty condor Feb 14, 2024, 3:32 AM

#

We had, but BlinkDL deleted

young sparrow Feb 14, 2024, 3:42 AM

#

last mauve don't we have regular checkpoints from the entire training run? and if we only ...

Blink deleted the checkpoints. And even if he didn't, this is problematic in that it underestimates performance. But maybe our model will do well anyways.

misty cedar Feb 14, 2024, 3:42 AM

#

last mauve don't we have regular checkpoints from the entire training run? and if we only ...

Its in the git history isnt it?

#

oh, guess lfs doesnt save it

#

I remember someone saying that was the point of the temp folder

jade lotus Feb 14, 2024, 3:46 AM

#

Anyone tried recuva or other recovery tools on any drive that had them? Or is it all cloud / not practical?

misty cedar Feb 14, 2024, 3:46 AM

#

https://huggingface.co/BlinkDL/temp/blob/43ce09802b0fe0748eb8a12dc1a75ff5fba62349/RWKV-5-World-7B-v2-OnlyForTest_49%25_trained-20231114-ctx4096.pth
like, they are still there maybe? about to see if I can download

RWKV-5-World-7B-v2-OnlyForTest_49%_trained-20231114-ctx4096.pth · B...

#

Yep, still downloads

misty cedar Feb 14, 2024, 3:48 AM

#

young sparrow Blink deleted the checkpoints. And even if he didn't, this is problematic in tha...

Models partial checkpoints downloadable from git history (Once again, who the f is paying for huggingface storage costs??)

void quartz Feb 14, 2024, 3:54 AM

#

misty cedar Models partial checkpoints downloadable from git history (Once again, who the f ...

can confirm - i honestly been downloading with the git commit - to avoid breaking my model links whenever temp folder get cleaned out =x

#

should i eval the checkpoints as well?

rose mango Feb 14, 2024, 4:01 AM

#

misty cedar Models partial checkpoints downloadable from git history (Once again, who the f ...

I hope huggingface unlimited storage lasts forever

jade lotus Feb 14, 2024, 4:07 AM

#

It runs on unicorn poo, it's good as long as summer lasts

#

I think the people in charge aren't the type to pull the rug without giving people a chance at a graceful exit - that might be a good thing to lobby for them to plan out and set up funds for, sooner rather than later

last mauve Feb 14, 2024, 4:28 AM

#

young sparrow Blink deleted the checkpoints. And even if he didn't, this is problematic in tha...

I was more asking "do you want evals across time, or a scaling laws plot"

Sounds like we want the latter, so my response is that we just have to compare against models with comparable FLOPs, so bigger or trained for longer.

gusty condor Feb 14, 2024, 10:43 AM

#

last mauve I was more asking "do you want evals across time, or a scaling laws plot" Sound...

I want the former because checkpoints are downloadable from git history.
We have less than a month for CoLM

tough crane Feb 15, 2024, 6:36 PM

#

void quartz for tests outside of lm-evals, we can add that seperately - im more focused in g...

@void quartz

Thanks a lot for your forking to run github actions.

To plot the following tasks in v4 paper at first, could you tell me the digits id of the GHA run's URL ( https://github.com/RWKV/lm-evaluation-harness/actions/runs/{digits} ) for the following settings?

num-shot: zero
params : 169m, 430m, 1.5B, 3B, 7B
tasks : lambada, piqa, winogrande, sciq, arc_easy, arc_challenge

tough crane Feb 15, 2024, 8:45 PM

#

gusty condor I want the former because checkpoints are downloadable from git history. We have...

it it possible to plot acc with two ranges that has no overlapped. Another choice is to build table style like Mamba's paper.

void quartz Feb 16, 2024, 12:19 AM

#

tough crane <@644428303293349888> Thanks a lot for your forking to run github actions. T...

since the github storage is not perma, im planning to download and dump to HF

tough crane Feb 16, 2024, 9:04 AM

#

void quartz since the github storage is not perma, im planning to download and dump to HF

Could you run firstly with these six tasks in gh-task-runner-Large-Suite.yml , because I would like to get results at first only for the tasks written above ? IMHO, I propose that figures of accs would be plotted with higher priority. If it's not permanently saved, the artifact could be down loaded manually within 90 days.

misty igloo Feb 16, 2024, 5:46 PM

#

steady ether V5.2 testing is done (for that AR experiment). We can probably use Stanford's re...

now that v6 training code is available are you able to run the AR experiment on it, too?

steady ether Feb 17, 2024, 3:02 AM

#

misty igloo now that v6 training code is available are you able to run the AR experiment on ...

Let me double-check. I think I had a run earlier, then I forgot about it 😅

charred atlas Feb 17, 2024, 8:59 AM

#

Hey would it be interesting to you to have numbers on the sentence embedding perf of the new rwkv? The repo talks about sent emb but havn't seen any scores (https://github.com/BlinkDL/RWKV-LM).

I'm happy to run it on mteb if interesting - just need to know which model to benchmark and if I can still load it in hf (https://huggingface.co/docs/transformers/en/model_doc/rwkv) ?

obsidian quest Feb 17, 2024, 3:36 PM

#

could someone test this for rwkv https://github.com/jzhang38/LongMamba

GitHub

GitHub - jzhang38/LongMamba

Contribute to jzhang38/LongMamba development by creating an account on GitHub.

void quartz Feb 18, 2024, 10:01 AM

#

charred atlas Hey would it be interesting to you to have numbers on the sentence embedding per...

if you mean the 2nd last layer state, as a means of embedding, you might want to discuss with @uneven blade

would be nice to figure out if this works in v5 like v4, and have a means of benchmark

undone solstice Feb 18, 2024, 1:14 PM

#

obsidian quest could someone test this for rwkv https://github.com/jzhang38/LongMamba

I can test this.

gusty condor Feb 18, 2024, 1:52 PM

#

undone solstice I can test this.

Test longer! Possibly more than 100k tokens, I think RWKV-5 can do that.

burnt cedar Feb 19, 2024, 12:39 AM

#

gusty condor Test longer! Possibly more than 100k tokens, I think RWKV-5 can do that.

At this point, newer papers going for the 1mil 10mil aswell

#

If a rwkv state can do that it's going to be crazy

gusty condor Feb 19, 2024, 2:50 AM

#

burnt cedar If a rwkv state can do that it's going to be crazy

RWKV (without fine tuning) can do that in perplexity test

burnt cedar Feb 19, 2024, 3:07 AM

#

gusty condor RWKV (without fine tuning) can do that in perplexity test

Stable ppl for a million?! That's crazy impressive

gusty condor Feb 19, 2024, 3:10 AM

#

Yes, at least 100k

quaint quiver Feb 19, 2024, 3:11 AM

#

ya but that doesnt really mean anything for actually recalling stuff far in the past

#

still impressive

misty cedar Feb 19, 2024, 3:20 AM

#

Stable at after a long conversation is still pretty awsome

quaint quiver Feb 19, 2024, 3:22 AM

#

misty cedar Stable at after a long conversation is still pretty awsome

ya ik im just saying for the needle in a haystack stuff it doesnt mean much

burnt cedar Feb 19, 2024, 3:52 AM

#

quaint quiver ya but that doesnt really mean anything for actually recalling stuff far in the ...

I still think testing larger states might help with this too

obsidian quest Feb 19, 2024, 3:12 PM

#

https://twitter.com/BlinkDL_AI/status/1759596571316883480

BlinkDL (@BlinkDL_AI) on X

100% composed by RWKV-6 120M params MIDI model🎶Still takes multiple trials for such high quality outputs, but I will fix this🙂

rose mango Feb 21, 2024, 4:04 PM

#

I'll add Gemma to the model comparison tables later today

young sparrow Feb 21, 2024, 4:07 PM

#

rose mango I'll add Gemma to the model comparison tables later today

Why is it important to do this?

rose mango Feb 21, 2024, 4:18 PM

#

young sparrow Why is it important to do this?

We already compare with Mistral and LLaMA, the most popular and most contemporary models. I think Gemma will likely see similar amounts of use, so it's worth comparing.

void quartz Feb 21, 2024, 6:11 PM

#

im glad its on hugging face atleast, gonna work on that too

void quartz Feb 22, 2024, 1:36 AM

#

gemma multilang benchmarks is running, along with normal benchmarks

misty igloo Feb 22, 2024, 4:17 AM

#

@obsidian quest are the v5,v6 hyperparams (LR start, end) same as they were for v4? no warmup, right?
v4 paper said:

Init LR 0.0006 0.0004 0.0003 0.00015 0.00015 0.0001
Warmup Mini-Epochs 361 411 443 451 465 544
End LR 0.00001 0.00001 0.00001 0.00001 0.00001 0.000007```

obsidian quest Feb 22, 2024, 10:04 AM

#

wamrup = only 10 steps.

tough crane Feb 22, 2024, 10:11 AM

#

@last mauve

I uploaded figures and related materials at the following paths.

1: png files are in images/0shot_acc
2: notebooks and csvs are in misc/plotting

misty igloo Feb 22, 2024, 5:28 PM

#

obsidian quest wamrup = only 10 steps.

10 mini epochs? What's a mini epoch exactly? I want to add these details to the paper

obsidian quest Feb 22, 2024, 6:32 PM

#

misty igloo 10 mini epochs? What's a mini epoch exactly? I want to add these details to the ...

10 steps. each miniepoch = many steps

#

1 miniepoch = [40320 / bsz] steps

misty igloo Feb 22, 2024, 6:37 PM

#

obsidian quest 10 steps. each miniepoch = many steps

outside of the paper, should i be doing this 10 step warmup in my experiments for new architectures/MoE?

obsidian quest Feb 22, 2024, 6:43 PM

#

it's --warmup_steps 10 for train.py

void quartz Feb 22, 2024, 9:40 PM

#

For all the various benchmarks, i have started consolidating all the results into the repo here:
https://huggingface.co/datasets/rwkv-x-dev/lm-eval-data/tree/main/summary

You can extract key figures if you want from the multilang / all result table
There is some bugs in the filtering/avg, and some data are still missing (eg. bloomz does well on avg, cause large number of the tests OOM)

rwkv-x-dev/lm-eval-data at main

#

But yea, its rather streamlined now for me to just add any model to HF, and in <48 hours, the CSV can be updated

#

the following is sorted by the average multilang score (llama2-chat OOM, so i need to rerun)

#

there are CSV file, sorted by model name as well

#

also if you want to inspect an individual run, you can crawl into : https://huggingface.co/datasets/rwkv-x-dev/lm-eval-data/tree/main/lm-eval-output for the full logs / jsonl / etc

#

alternatively its the eng test by groups (0 results is due to a test error blocking to overall upload, fixing)

void quartz Feb 22, 2024, 11:54 PM

#

er.... i gotten gemma 0 shot benchmarked, can i request someone independently check this, seperately or something

#

like its bad enough, that im sure its an error in my setup/pipeline or something

spring fulcrum Feb 23, 2024, 12:24 AM

#

void quartz er.... i gotten gemma 0 shot benchmarked, can i request someone independently ch...

are you using the patch described here: https://github.com/EleutherAI/lm-evaluation-harness/issues/1455 ?

GitHub

Run Gemma LM in Huggingface (simple patch) · Issue #1455 · Eleuther...

Posting this here for visibility: The following diff fixes Gemma performance: diff --git a/lm_eval/models/huggingface.py b/lm_eval/models/huggingface.py index e6ffc828..3f775dbb 100644 --- a/lm_eva...

#

(would also need an analogous add_special_tokens=True for generative tasks)

I'll be PRing this asap to the harness (should be by tomorrow morning) along with the ability to control whether a BOS token is used for causal LM models in general

obsidian quest Feb 23, 2024, 12:29 AM

#

void quartz the following is sorted by the average multilang score (llama2-chat OOM, so i ne...

add x060 1.6b. it's great at multilingual.

void quartz Feb 23, 2024, 12:30 AM

#

spring fulcrum are you using the patch described here: https://github.com/EleutherAI/lm-evaluat...

thanks!, will pull that - that explains the wierd results

#RWKV-papers

Guide to run lm-eval with Eagle