#RWKV-papers

1 messages · Page 5 of 1

steady ether
obsidian quest
#

how's the paper going 🙂

misty igloo
#

@gusty condor would you mind if I try some new language for the introduction? feel free to throw it out if you don't like it as much

young sparrow
#

It doesn't have one

#

Anon periods are (mostly) unique to *CL venues

misty igloo
#

either here or via comments in the overleaf itself

obsidian quest
#

it's a bit unfortunate that we used "RWKV" instead of "RWKV-4" lol

misty igloo
obsidian quest
gusty condor
void quartz
gusty condor
#

It's yours

void quartz
#

(side: does it make sense to branch the tokenizer to its own paper?, saw that section)

gusty condor
#

No, do you have enough information to fit that into a 8-page-long paper?

void quartz
subtle oak
void quartz
#

there is a side tangent, of seeing if a model perform better with the new tokenizer in another language (and english), compared to baseline - which might add up pages

#

(we are kinda assuming it gives better results, kind of - it is more token efficient for sure)

#

cause being trie based only, flys against current convention wisdom of BPE tokenizer

subtle oak
#

Maybe can submit a short paper instead of regular size?

#

4 pages instead of 8? If do not have enough info to fit 8 pages?

steady ether
subtle oak
#

Oh I see, few lines in Sec.2 mentioned these topics, thanks!

subtle oak
steady ether
#

I think we're just missing something like, 'The original RWKV architecture has limitations when it comes to X, Y, and Z, so we decided to try RWKV-5 to address X and Y, and RWKV-6 to address Z.

gusty condor
subtle oak
last mauve
subtle oak
last mauve
subtle oak
#

Oh sorry I misunderstood. So it will function more like a traditional related work section, for introducing some previous related work while introducing concepts that will be frequently used in the following paper?

last mauve
subtle oak
#

Oh I see there are separate parts in the first paper… I’ve never noticed here before. I think I finally got what we need in this section. thank you so much!

last mauve
#

Ok so in comparing the arxiv-v1 and EMNLP versions of the first RWKV paper, I actually think we can just replace the current arxiv with the EMNLP version, and move directly to the RWKV-v5/v6 arch paper.

Edit: Ok, arxiv has been updated. Let's move forward with RWKV-v5/v6

last mauve
#

High-level things that need done in the RWKV-X overleaf:

Background:
1. Subsection on RNNs (similar to first paper, but directly copy nothing. Reword at the very least)
2. Subsection on Transformers and AFT (again similar to first paper, but directly copy nothing. Reword at the least)
3. Subsection on RWKV-v4 (summarization of the first paper, with an architecture figure). Can probably retool the current section 3 header in RWKV-X at "RWKV Architecture Summary" for this, along with the start of section 3 in the EMNLP version

Related Work:
Use the first paper's related work in appendix C as a template. Remember that this is anonymous and we can't say this is our arch.
4. Reword and update related work from Appendix C as a base
5. Add any subsequent work (mamba, hyena, RWKV-v4, etc)

Design:
6. The existing subsections 4.x in RWKV-x need more explanation and we need new figures similar to Figures 2/3 from the RWKV-EMNLP

Evaluations:
7. Need a set of figures on downstream tasks comparing to transformer and SSM arches (including RWKV-v4). Similar to RWKV-EMNLP's figure 5
8. Need scaling law results like figure 4 of the mamba paper figure 4 of RWKV-EMNLP (see for context on why we don't want a figure like mamba)

Trained Models:
9. The existing section 5 and Table 1 in RWKV-X is pretty good. Some comments are to:
**9a ** Add a "Name" column like table 2 of RWKV-EMNLP,
9b Clarify that these equations are per-token
9c All of the subscript-5/6 should be updated to subscript-v5/v6 to make it more explicit that these refer to different arches

Several other sections need started, for which the task is "start".

#

I'm going to start by making the high-level structure a bit more clear, and make sections more contributor-friendly with TODO statements and section skeletons

misty igloo
#

@last mauve for adding more explanation to subsections 4.x in RWKV-x do you mean that we need description of what's going on and how it works mechanically because the formulae are currently unclear, or some description in that section of the motivation for why these mechanisms were chosen?

last mauve
young sparrow
# last mauve High-level things that need done in the RWKV-X overleaf: Background: **1.** Sub...

8: I actually rather dislike the scaling laws plot in the mamba paper. They do not seem to perform any search for the optimal token-to-parameter ratio and instead assume that it's the same as it is for transformers. In the scaling laws plot I added to the EMNLP version, as well as both Kaplan et al. and Hoffman et al., instead we search many combinations of (parameters, tokens) and then find the optimal configuration for each FLOP value and fit the curve to that.

The reason this is problematic is that it can disadvantage models that have different optimal tradeoffs. If they were just comparing to the optimal tradeoff identified in our paper or in Hoffman et al. that would be fine as it would only disadvantage their model, but they also do this for several competitor models. This makes it impossible to know if they're hurting themselves more than they're hurting the competition.

#

That plot is meaningful as an argument that the architecture is better because for a fixed (param, token) pair the architecture outperforms others, but it's not an argument that the optimal scaling is better because it doesn't remark on the optimal scaling regime at all.

#

Put another way, it's effectively the same plot as our "average of 12 benchmarks" plot but using Pile loss instead of 12 NLP benchmarks. It's not a scaling laws plot.

obsidian quest
# last mauve Ok so in comparing the arxiv-v1 and EMNLP versions of the first RWKV paper, I ac...

https://arxiv.org/abs/2305.13048 seems not updated yet

#

Should we name it
RWKV-5 and RWKV-6: xxx

#

CHANGE:
In this work we present RWKV-5, which builds on the architectural improvements and learned decays from RWKV-4, as well as the matrix valued states found in Linear Transformers.
(because it was proposed in Linear Transformers, not RetNet)

#

CHANGE:
Influenced by the Retention Network (RetNet) architecture ==> Influenced by the Linear Transformer architecture

#

GroupNorm = LayerNorm for each head. So no need to say it's GroupNorm.

#

Token 257-65529: actually includes lots of languages, not just Asian. and symbols.

Moreover it's a greedy tokenizer. Faster and Easier to code.

#

We can follow this narrative:

  • Matrix-valued states were proposed in Linear Transformers.
  • RWKV = [exp. decay + token shift + AFT]
  • RetNet found [exp. decay + xPos + Linear Transformer] works
  • So RWKV 5/6 is doing [exp. decay + token shift + Linear Transformer]. We don't use any extra postional embedding.
    Moreover RWKV models are much better tuned than RetNet. We can show the loss curves.

And we should compare with Mamba, GateLoop, etc.

We can make a table:

  1. decay/gate: real-valued exp. decay, complex-valued, data-dependent etc.
  2. positional embedding
  3. state: RWKV4 = vector state, Mamba/SSM is like "multi-vector" state, and then we have matrix-valued states
tough crane
obsidian quest
#

pretraining loss curve. train from scratch on new data

tough crane
obsidian quest
misty igloo
obsidian quest
#

Extra Silu gate is used in Mamba too

#

We can mention RWKV-5-lite as a variant without custom cuda kernel requirement for training

#

rwkv5 rwkv6 were trained with 0.001 weight decay (only for matrix-valued weights: linear, emb)

#

mamba is utilizing SRAM for similar parallelization

misty igloo
#

@obsidian quest did you have any comments about the token shift descriptions? I want to make sure I'm not getting anything wrong about the rationale

obsidian quest
#

token shift = induction head & locality a priori, similar to conv1d with kernel sz 2 too

misty igloo
obsidian quest
#

and we can use this_token + last_token to detect this

obsidian quest
#

#general message
We should emphasize RWKV-2-RNN was the first to show "exponential decay is all you need"

#

can add a section in appendix for the timeline of RWKV

subtle oak
#

Is it like a chronicle from RWKV-1 to RWKV-6? Maybe I talked this before😂

obsidian quest
#

from https://arxiv.org/abs/2312.06635

This type of model with matrix-valued hidden states that change over time is also known as “fast weights"

yeah we should make Schmidhuber happy too 😂

gusty condor
last mauve
misty igloo
last mauve
# misty igloo I've been somewhat caught in the cross currents here, trying to thread the needl...

There's a very clear distinction on what's appropriate, I think. If the secondary info (e.g. intuition from a similar study/paper like "studies on CNNs demonstrate that shallow layers learn general representations while deep layers learn specific representations [cite]") on a design feature is included to help the reader understand how/why the design feature works, that's appropriate. If the secondary info is for any other reason (e.g. to claim ownership or to give interpersonal/organizational history like "we discovered XXX in May 2023 before Mamba"), then it violates double-blind and isn't appropriate for a paper.

Anything flies for a blog post, and I encourage people to post the history and demonstrate ownership there.

#

To be clear though, we're still able to make statements in the Background and Related Work sections such as "RWKV [cite] introduced exponential decay is all you need", but they can't be excessive and they can't violate double-blind

misty igloo
#

I think I've avoided adding anything that's inappropriate in terms of anonymity in all the sections I wrote or edited to date (of course feel free to correct me if not)
The push and pull for me is more just about what extra background we include in terms of the (often third party) developments that lead up to this combination that we call RWKV5 and 6, since I think Bo has expressed wanting that in the paper.

void quartz
#

btw how many people here is at neurips?
(dropping by tmr)

remote elbow
#

#1171291697561477170

last mauve
void quartz
last mauve
young sparrow
#

@void quartz I'm here all week, would love to meet you

void quartz
#

Great!, see you both from tmr morning then 🙂

last mauve
#

To help me in writing this paper, can someone in clear terms either explain or point me to something comparing the Mamba arch with RWKV-v4? Mamba will likely be our primary competing arch and I want to be able to strongly differentiate RWKV from mamba in the background/related sections of the upcoming paper

misty cedar
#

I guess with Based releasing benchmarks for multiprocessing using linear transformers, these graphs gain a little more relevancy.

void quartz
void quartz
#

kinda dumb: Can we do a direct counter of the RetNet paper "parallel table", with a clear definition of parallel in the v5 paper

We got rejected for the OakRidge compute grant, over a new RNN (yet to be out, so no ideas the detail), that cited that retnet paper, and said that they fixed that parallel problem, and is the reason why RWKV could not scale past 14B.

obsidian quest
#

that is very mean of them to speak so lol

#

we can easily demonstrate the training speed of rwkv is constant regardless of ctxlen

void quartz
obsidian quest
misty igloo
void quartz
misty igloo
#

Earlier SSMs were historically computed using long convolutions in $O(N\log N)$ time per sequence, but could also be formulated as recurrence relations. Recent SSMs featuring data-dependent $A$ and $B$ terms (GateLoop, Mamba) are only able to be formulated as recurrence relations. Generally, such recurrence relations can run in $O(N)$ time with respect to sequence length

silent urchinBOT
#

Smerky

obsidian quest
misty igloo
# last mauve To help me in writing this paper, can someone in clear terms either explain or p...

RWKV-4 and Mamba are quite different, but RWKV-6 and Mamba are much more similar

Mamba follows the traditional state space mechanism (more or less) of:
$h = h {\Delta A} + x {\Delta B} \
y = h C + x D$

where dB expands x into a new dimension and dA is supposedly a diagonalized version of something theoretically complicated
(I say supposedly because their code doesn't quite match their paper and some things are unexplained)
and C reduces the hidden state back to the embedding dimension

RWKV-6 is more like

$kv = (x W_k)^T (x W_v) \
h = h w + kv \
y = r (h + kv \cdot u)$

unfortunately, I don't know of a way to clearly show the differences between these

silent urchinBOT
#

Smerky

misty igloo
gusty condor
spiral minnow
misty igloo
#

I guess I'm also just not certain what Quentin's goal is in showing the differences so it's hard for me to know if that suffices 🙂 As seen above, their attention formulae have terms that are quite similar in some places... but there's a lot of nuance too, like because of the way the Mamba incoming projection replaces some of what normally would be the projection from inputs to values, and how multiplying out (k^T)(v) per head is different than just expanding the full input by a smaller new dimension via matrix dB

So despite being similar, the differences are quite complicated.

#

And just to add a cherry on top, the Mamba code appears NOT to quite match the paper. And Bo says that the results don't match either!!! (And that the reported results must employ some secret sauce that isn't in the publicly released code)

#

Fun stuff

steady ether
#

The authors of Mamba will be giving a community talk at NeurIPS. Those attending the conference can go and ask them questions. 😉

void quartz
#

sadly 😦 will miss it - me & harrison - our flight got delayed till 5pm

obsidian quest
obsidian quest
#

please update RWKV-4 paper to use "RWKV-4" instead of RWKV 🙂

gusty condor
#

I wonder whether title is changeable

RWKV-4: Reinventing RNNs for the Transformer Era

Anyway, my opinion is that, if we matter anonymity, then that might be a bad idea (alluding that "we" have developed RWKV-1 to 3, and aiming for 5+), but if we are already famous, then that doesn't matter a bit (like OpenAI's articles are only posted on OpenAI's website and not anywhere else).

misty igloo
#

that's a good way to compare that part in the paper, if we want to!

void quartz
#

(Do paper reviewers expect us to change reality to fit their version schema)

misty igloo
#

it's just meant to protect the review process so that there is no biased treatment of the paper i.e. for acceptance into a journal

young sparrow
young sparrow
young sparrow
obsidian quest
young sparrow
obsidian quest
#

they are using this opportunity

#

they certain know the existence of rwkv 5/6 and avoid mentioning it

young sparrow
#

I don't know that to be true and I think it's immoral to accuse them of that unless you are certain

#

Have they told you that?

obsidian quest
#

some of them follow my twitter

young sparrow
#

That doesn't mean that they know that the models are finished

obsidian quest
#

rwkv5 models were released long ago

young sparrow
#

That doesn't mean that they know that the models are finished.

And like I said, it's widely considered problematic to compare with unpublished work. Even if they know about it, they could be waiting for a paper and not trying to sneakily make themselves look good

#

Accusing them of acting in bad faith based on this evidence will only cause people to dislike you and not want to compare to your work

#

I cannot more strongly recommend that you stop doing this

sharp sonnet
#

I agree. Unfortunately, people look for published work (or a preprint) to compare.

We have no reason to believe anyone acted in bad faith. Very likely, this happened just because the researchers may not have realized the work is finished.

young sparrow
#

Also writing papers takes time. For all you know they finished the experiments a while ago and only just got the paper out

sharp sonnet
#

The right steps would be publishing our preprints faster and reaching out to authors if any claim is incorrect so that they correct them (eg the parallel table)

obsidian quest
#

It's unfortunate that we don't have as much resources

young sparrow
#

Yes it is

obsidian quest
#

The table in RetNet is certainly acting in bad faith, so I do think there is some hostility towards us as a potential competitor

young sparrow
#

I agree that they're not playing nicely.

sharp sonnet
#

I don’t know much about what happened. However, I strongly believe we should just continue doing good science

young sparrow
#

But this is still the wrong way to go about addressing this fact

obsidian quest
sharp sonnet
#

We can add experiments correcting any of the possibly incorrect claims.

obsidian quest
#

In the future I should make a disclaimer that my rants only represent myself and don't represent RWKV views 😂

young sparrow
#

Isn't there a RWKV Twitter? Using that to distribute release info would be helpful on both the reputational and the advertisement front

obsidian quest
steady ether
#

How about we release a working paper on RWKV-5? It doesn’t need to be complete.

young sparrow
obsidian quest
#

I am the kind of people who have the tendency to sometimes break rules as long as they don't harm others (and i will take / pay for the consequences too, will not avoid them) 😂 most people will hate me

#

It's my fault that we used "RWKV" for the RWKV-4 paper, and haven't published the RWKV-5/6 paper in time. Life is harsh 😂

void quartz
void quartz
#

And also to not rush everyone working on it in this channel

#

So let’s aim for mid/late Jan?

#

For those who compared to v4 - we can ask them politely if they can add v5 to compare (the 1.5B / 3B models) when appropriate.

If they did so in good faith, they would be open to amend.

If they did it in bad faith, I doubt confronting them will change anything (like retnet)

misty igloo
#

Is the plan to publish with full v5 7B results but a more limited set of v6 results? (1.5B or maybe 3B by preprint release time)

#

@spiral minnow just wanted to note that I removed your addition of quadratic memory complexity for transformers - that has been shown to be unnecessary e.g. flashattention

void quartz
misty igloo
#

I added all the formulae and descriptions so that we wouldn't fall behind

#

Just in case we were ready - if not, that's fine and we can delay the v6 paper easily

#

since we have it all written now

void quartz
#

i would defer to those who know the academic norms then on this, was worried it just wierd that we added v6 without all the models

#

i think another direction if we want to push against this issue

#

is we need to publish blogs

misty igloo
#

yeah, just a question of whether a 1.5B model for v6 enough when we show up to 7B for v5

void quartz
#

so it doesn't have the same rigor requirements for the paper, and is atleast official enough

misty igloo
#

alternatively, we could publish a working paper for v6 - but it's probably less work for it to remain integrated into the current paper

void quartz
#

u know what - setting up an RWKV blog has been so long on my todo - just gonna set it up via substack

#

( classic coder conflict of wanting to do it better, but not having the time )

misty igloo
#

at least then they will have to show the v6 1.5B results when comparing to their 1.5B results

void quartz
#

i see what you mean there

#

okok that sounds good (didn't consider that part)

misty igloo
#

we should also release a 125m model btw so people have a reference point

#

and any standard sizes people tend to use in between

#

since we can train those quickly

#

and it will help ensure that upcoming papers quote our best results

#

especially when they don't train larger versions, it's useful to have our small one shown to compare side by side

#

I'm personally in favor of keeping the two papers integrated as they are now, simply because it's less effort than making a whole new one. But I'm open to a separate rwkv-6 working paper or somesuch if our advisors think that's best!

void quartz
#

so we can show the transition

#

if the result is close enough for the partial, it can close off a possible criticism that its not a fair compare with different dataset/tokenizer

misty igloo
void quartz
#

not sure if this is useful, or a waste of resource (which is already limited)

#

the idea is just all 3 x 2 varients

misty igloo
#

it'd be great, but I'm not trying to make more work or strain our resources... just any single 125m v6 model would probably help a lot

#

because it will force people to show it in comparisons when they only have their own small models to compare to

#

(this only helps if we publish a v6 paper tho)

obsidian quest
gusty condor
void quartz
#

IMO - i think the world tokenizer needs a separate paper

been speaking to multiple researchers who are doing research specifically for their nation language model (and faced tokenization issue) and are working on their own region tokenizers

and there is lots of interest in how and why we did the world tokenizer without BPE, and what would be its compression ratio be for their own respective language

#

If proven out as things progress, the "trie tokenizer" approach can end up replacing BPE - if that makes sense - and this is completely seperate from the architecture

young sparrow
obsidian quest
void quartz
#

Questions like - does it hurt evals - or learning rate was up in the air : which I could not answer accurately 😬

Intuitively the rwkv world model says it’s ok. But that’s a gut feel not a tested hypothesis

#

Using greedy tokenizers is very counterintuitive given how established BPE is

So same situation of RNN 2 years ago haha

obsidian quest
#

because my world tokenizer respects utf-8 boundary & word boundary. this is very important

#

otherwise you can have bad tokenization (such as "aliasing")

last mauve
young sparrow
tough crane
void quartz
young sparrow
#

Did they tell you explicitly that this is why you were rejected?

void quartz
#

that the preferred RNN candidate, uses the retnet claims, as justification to support them over us
(there is no paper, no materials, etc for the other group)

young sparrow
#

Does our application contain evidence to the contrary?

void quartz
#

no - we had no idea we would had to fight that claim

#

we have provided multi-node training data - but our largest is 8 nodes?

young sparrow
#

Not having evidence that your model scales efficiently is typically a decent reason to reject

#

You don't need to run it for long, but you absolutely need to show the ability to leverage large scale resources effectively

void quartz
#

i see, might be why our rep is trying to settle for a smaller grant amount - to prove out leveraging large scale resources specifically

#

cause it is a chicken and egg - we cant prove we can run on 1000 nodes, till we get limited access at least

#

they did ask as a follow up (before rejection) - have we ran on 1000 nodes, do we think it will work

  • no, we never had such access to run at such scale
  • yes, as we are built on pytorch lightning for multi-node training, which has been shown to scale past a 1000 nodes for deepspeed on transformer architecture. RWKV leverages pytorch lightning and deepspeed in the same way.

they did run with us across 100 nodes (for 1 hour?), as part of the validation, but we have no proof of going beyond a 100

tough crane
steady ether
void quartz
void quartz
#

hmmm would multi-node CPU count?

#

i just presumed we need to atleast put a GPU

steady ether
#

They still have GPUs, just really weak ones.

obsidian quest
tough crane
#

plotting y-(performance, training time elapsed) and x-(nodes=8, 16, 32, ..., 256)

young sparrow
#

@void quartz I know some people. Let me see about pulling some strings. Was your application for Frontier?

last mauve
last mauve
void quartz
misty igloo
# steady ether This has been updated in the V5 paper. It should be pretty clear going forward.

Why are we saying transformer has memory complexity of N^2? That's been shown to be avoidable e.g. FlashAttention
I'm not sure that saying SSMs have memory complexity of NlogN is really correct, either
And what is the N in memory complexity? Many parts of this table don't seem right to me
Also, saying that RNNs can't do multi-gpu training is very questionable... since rwkv is an RNN
Maybe you mean a specific RNN architecture like LSTM?

void quartz
void quartz
#

that can help disprove and kill off the "cannot train at scale" claim to rest

#

(ps: we had issues with the frontier AMD node scaling past 100, from what looks like node-to-node communication issues)

#

you folks probably know better at a 1000 node scale, architectually speaking since its just DDP training runs, and all of that is deepspeed - am i alright in understanding this is handled by deepspeed / pytorch lightning ?

void quartz
obsidian quest
#

heard megatron is much better at scaling

void quartz
obsidian quest
#

ok got this PM "wait before using megatron, we will release soon a nanotron" 😂

young sparrow
#

You'd need to write a bunch of custom code to use Megatron, since it was designed for transformers

#

If you're going to put that work in, I highly recommend using GPT-NeoX which is a similar library to Megatron with DeepSpeed support and other custom features.

#

(Or, "I highly recommend chatting with Quentin about if it would be a good idea to add..."

void quartz
#

the GPT-NeoX codebase is significantly easier to understand then Megatron itself

young sparrow
#

Quentin works very hard to make it so 🙂

steady ether
#

Ok, I've changed to "Vanilla Transformer" and "LSTMs".

misty igloo
# steady ether This was meant to provide clarification on a frequently referenced table.

Ah gotcha. Didn't remember where I had seen that table before 🙂
I'll take a look back at the retnet paper, but I think placed here it's missing some context that's important. Also, saying SSM is not the same as saying H3/S4/Hyena, since Mamba is a SSM (and also probably shows that those two can now be implemented in what would be called O(N))
I'm a bit worried that copying RetNet's table may not be a great path for us.

#

I mean, I'd go as far as to say that their data in that table is extremely misleading. We don't want to do the same thing!

steady ether
#

Yeah, that's a good point!

misty igloo
#

This whole idea of long-sequence memory complexity that they claim is kind of a red herring. 😭

#

Maybe we can find an alternative way to point out the differences that show RWKV's benefits

#

And just to be clear, RWKV and Mamba are very similar in all these kinds of metrics. We shouldn't avoid that fact

steady ether
#

By the way, looks like the S5 paper also has a somewhat similar table

misty igloo
#

that table presents a much fairer comparison imho

#

but 'parallel' yes/no for RNNs is still pretty misleading

#

actually I think this table is wrong too haha

#

the inference column is somewhat misleading

steady ether
#

Ah, they sort of clarified earlier:

while also being parallelisable across the sequence dimension during training.```
tough crane
misty igloo
#

Rather than copy someone else's table, let's come up with a plan for what we're trying to show in comparison and figure out how to best represent that

#

and we can make it fair, unlike retnet paper

tough crane
#

Ofcourse, Transformer's quadratic attentions is NOT parallelizable in MS's survey's definition because of fully connected matrix multiplication along time axis 🤣

young sparrow
#

We do all of these simultaneously

#

Isn't this exactly what the "unrolling" at train-time for RWKV is for?

tough crane
#

Isn't this exactly what the "unrolling" at train-time for RWKV is for?

I personally think it's exactly possible if we have a batch with 9 sequences in parallel.

misty igloo
#

rwkv5.1 does this sort of matrix multiplication, but rwkv5.2 and rwkv6 CUDA kernels don't bother to parallelize across time because it's highly effective to keep everything in gpu SRAM for a huge constant time speedup and obtain excellent parallelization over the non-time dimensions

#

mamba claims to use parallel associative scan to parallelize over time as well, but I haven't evaluated it to see if they actually do that in their CUDA code (their code often mismatches their paper in other ways so I'm a bit skeptical)

#

and to be clear, the current draft skips 5.1 and only describes 5.2 and 6

tough crane
#

If my assumption is wrong, then I'm not sure about the attached table's definition

misty igloo
#

I don't really know what MS's survey idea is, or if it's at all reasonable, but I think we just need to try to be fair and descriptive

#

as i recall they already agreed to get rid of the training parallelization column in the next revision, according to @steady ether

tough crane
misty igloo
tough crane
#
  • Batched Parallelization along time (RWKV-v4 and the other decoder-only models could do this type)
  • Single Sequence Wise Parallelization along time ( Mamba asserts this type ?? )
misty igloo
#

to be clear, rwkv 5/6 can be implemented the same way as mamba claims w/ parallel scan - they just don't happen to be in the code released

#

I already state all of this in the draft

#

but we can certainly clean up that language if needed

#

I don't really understand what is being argued about here 🙂

tough crane
misty cedar
#

Rwkv v5 can be rewritten to be reliant on only a "cumulative sum with decay" operation for cross temporal information bleeding. V4 was the same... how do other linear models perform their operations in a way that they are more paralellizable than that? Chunked temporal information forwarding?

tough crane
#
Model Type X-parallel Type Y-parallel
name1 Yes Yes
misty igloo
misty igloo
#

many of these models were not implemented this way, including RWKV 5.2, 6

#

but that doesn't mean they can't be if it were useful to do so

#

and afaik the only reason mamba is able to get away with doing so without a horrific constant time penalty is that they limit their effective head dimension to 16

#

but that's unrelated to computer science asymptotic time complexity calculations

misty cedar
#

Even then, you can implement a massive triangular decay matrix, multiply it to unmixed state, then do the cumsum using a tree algorythm for max parallelism. It's technically parallel, but it's so much more efficient to just do a scan

tough crane
#

and afaik the only reason mamba is able to get away with doing so without a horrific constant time penalty is that they limit their effective head dimension to 16

Mamba's claim seems to depend on GPU RAM size ??

misty igloo
#

this whole discussion is just something that MS created by releasing TWO preprints with bogus analysis and false claims

#

and they have agreed to retract that part

#

so I still don't really understand the goal here for us

misty igloo
young sparrow
misty igloo
misty cedar
misty igloo
#

Like, a discussion section? Or a table of some sort?

#

The problem with a table is that essentially nearly all the models have the same entries in the table, in terms of asymptotic time complexity and parallelizability across time

young sparrow
# misty igloo do you have a suggestion on how to approach it?

I think it's a good idea to prep a blog post that shows the different tables, explains why they're wrong / explains the issues with succinctness, and presents a corrected table.

Maybe we won't release it for a while, but it'll be good to have on hand.

misty cedar
#

What's the academic equivilent of "as per my last email"?

"Contrary to (xyz et al, 12a section b), parralelization blah blah..."

young sparrow
steady ether
#

Hmm, a table having the same entries is a bit problematic. Agree with blog / discussion section

steady ether
# young sparrow Why

I'm on the fence. This paper is really about introducing V5, but it's also important that we clarify training parallelization.

young sparrow
#

I meant why is the table having the same entries problematic

steady ether
#

There were some concerns about us using a very similar table to other papers.

young sparrow
#

Why would that be concerning

steady ether
#

I guess not then? I made an assumption based on an earlier conversation: #1103039376184852622 message

gusty condor
# misty cedar Academic pettiness? Their incorrect claims may have negatively effected a comput...

Their RetNet paper is not receiving good feedback, see https://openreview.net/forum?id=UU9Icwbhin (especially Reviewer 8FpU), where the table is questioned the most.

tough crane
#

Q1: “Impossible Triangle” is an absolute overclaim because RWKV and H3 have already demonstrated models are comparable to Transformers

A1: The claim is fair enough. The “comparable performance” means that the models achieve similar results under the same setting (e.g., #parameters, and training corpus). For example, previous comparisons use Transformers with absolute position while the compared methods benefit from relative position modeling. Moreover, in H3 paper, the comparable results are in hybrid settings (i.e., combine H3 and Transformer layers), but we don’t add any Transformer layers. We conducted various controlled experiments (with matched #parameters and using the same training corpus) to compare different architectures. We are confident that the claim holds well. The experiments in Table 4 also show that previous methods still have a big gap.

Q2: RWKV can indeed be computed in parallel.

A2: We give a clear definition on “training parallelization” in the caption of Table 1, which is discussed from the sequential perspective. “∗”: whether the training implementation is sequentially parallelized, although RWKV uses channel-wise parallelism. As stated in A1, RWKV’s performance is actually not comparable w
gusty condor
#
Q2: RWKV can indeed be computed in parallel.

A2: We give a clear definition on “training parallelization” in the caption of Table 1, which is discussed from the sequential perspective. “∗”: whether the training implementation is sequentially parallelized, although RWKV uses channel-wise parallelism. As stated in A1, RWKV’s performance is actually not comparable with Transformers according to our experiments (i.e., same #parameters, same data, and with relative position modelings). So, the statement of RWKV in Table 1 is fair enough.

Relative position modelings hurt RWKV performance? 🤔

rose mango
#

Wow those aren't great reviews

#

The training parallelization definition is like Internet providers offering unlimited* data

*Notice the asterisk

rose mango
rose mango
#

@misty igloo didn't you do a training run with/without RoPE?

misty igloo
rose mango
#

Also, should there be test runs on small models (<100 M) a la TinyStories?

young sparrow
rose mango
#

Since the models aren't large, I can do it

obsidian quest
#

lets add another column to the table: state size.

rwkv2/3/4 has the smallest state size of all models here. this is a plus in some scenarios.

it's the first and only design achieving good LM performance with such tiny states.

a rwkv4 with rwkv6 trick will be highly interesting.

gusty condor
#

@misty igloo How are your experiments about RWKV's positional encoding going?

misty igloo
#

my historically 'best' model is one that pairs some of the parts of rwkv like token shift with more traditional MHA

gusty condor
misty igloo
#

I've tried using RWKV-style weight decay with traditional MHA and in my experience it works almost as well as ALiBi

#

I have some new models that use that new Based softmax approximation alongside RWKV style decay and linear attention and it works great

obsidian quest
#

it's a trainable alibi. should be better.

misty igloo
#

then again, alibi only operates per head, unlike rwkv5.2/6+

obsidian quest
#

exp(additive) 🙂

misty igloo
#

but it works great

#

this is what I meant about alibi being linear over time

#

(from their github)

#

maybe you meant the exponential part is the softmax applied to that

misty igloo
# obsidian quest it's a trainable alibi. should be better.

its possible my initializations were better for alibi and my trainings didnt run long enough, or maybe some other confounding factor
I wasn't specifically trying to drill down onto positional encoding at the time - just was trying to rapidly find the best mixed model for use with that Based approximation

#

(second order taylor series approximation of softmax)

obsidian quest
obsidian quest
misty igloo
#

@obsidian quest regarding what @gusty condor was saying, do you think that token shift adds short term positional information?

#

(If so, I'd like to understand that aspect better so we can include it in the paper)

young sparrow
misty igloo
misty igloo
young sparrow
rose mango
# young sparrow It's not at all clear to me what someone is supposed to glean from this TBH

It's late here and I only glanced quickly, but what I believe is stated is:

  1. The linked paper is the "Gated Linear Attention Transformers" paper, which compares their new GLA architecture with Mamba based on the Mamba code on GitHub.
  2. GLA outperforms Mamba on multiple metrics, presumably in contrast to what is stated in the Mamba paper.
  3. For this to be the case, there must have been some trick to produce the numbers in the Mamba paper; naturally this couldn't be done in the GLA paper as it's an independent evaluation.
young sparrow
alpine ferry
#

this paper looks super interesting, is there still any tasks you could use another contributor? I see the paper is already out there on arxiv,etc so totally fine if its too late to join. Learnt a lot from this work tho :). Great work!

rose mango
#

v4 is published, v5 is what's currently being worked on

gusty condor
misty cedar
#

is it worth putting in psuedo-torch for the non-academics?
or just link to code bases on github?
eg:

super naive rwkv v5 linear attention is:
(H = heads, C = dims, B=Batch, T=Time)

k = k.reshape(B,T,H,1,C//H)
v =  v.reshape(B,T,H,C//H,1)
r = r.reshape(B,T,H,1,C//H)
kv = k@v // B, T, H, C//H, C//H 
att = kv.cumsumwithdecay(decay, dim=1)
out = matmul(att, r)
# groupnorm and output head after this

with little effort you can fuse all these operations into a single kernal to save memory and compute ( fused kernal lowers intermediary memory usage from O(C^2) to O(C) , while being parallelization along B,H,and C )

gusty condor
#

Yes, put them in the appendix

misty igloo
#

I think we should put in 'u' (bonus) into any version we list, so it represents the actual architecture

alpine ferry
#

do we have a potential timeline when we would like to release the paper?

young sparrow
alpine ferry
#

Oh nice, what a relief not having the anonymity restrictions

gusty condor
#

RWKV-5 is not finished yet, and 1 month without progress

gusty condor
polar atlas
rose mango
#

Let me know if there's anything specific I can help with

last mauve
tropic minnow
#
``` i think we could explain more about this. probably the promp-engineering sensitivity of rwkv-raven does not apply here as these are not specifically chat models, but one would expect error distribution to be similar (associative recall, etc). In the [based] blog post [ https://hazyresearch.stanford.edu/blog/2023-12-11-zoology2-based ] the stanford/hazyresearch team showcased a pitfall example, and i expect rwkv models to behave similarly. should we refer too that
steady ether
#

This is with v4

young sparrow
#

@steady ether How much compute would you need to fully reproduce it

steady ether
#

A quick estimate would suggest running 8xA100 for 10 days.

VRAM doesn't appear to be as crucial, so 8xA10 should suffice

obsidian quest
steady ether
#

I think that version is x052?

steady ether
#

So far V5 looks amazing

#

The paper's models for comparison.

obsidian quest
#

and v6 is better

tropic minnow
#

Anyone in SF atm? I’ve been offered a talk at a subquadratic attn meeting on the 25th afternoon which I plan to do virtually, but just in case

burnt cedar
#

If so will it extend farther?

#

Like mamba going from 64 to a million?

steady ether
#

Yeah, partially. It's v5.2, not the final tokenshift. So far, it slightly outperforms Mamba on the Stanford benchmark.

steady ether
#

V5.2 testing is done (for that AR experiment). We can probably use Stanford's results for the other models to save on compute.

rose mango
#

These results look amazing

young sparrow
steady ether
#

Our v4 runs are slightly different so there's some variance there. Same goes for the previous incomplete run with their other models.

round kelp
#

Hello everyone.

gusty condor
#

hi Xaiat

harsh narwhal
#

hello everyone

cloud tendon
#

Hello everyone

undone solstice
#

Hello everyone, I’m new to this community, but I’m eager to contribute to this project.
But, I am a bit confused about how to contribute. Do I just look at the text on overleaf and start editing them?
Also, would this paper be more interesting if it could add some evaluation or finetuning experiments on code generation tasks (like HumanEval). If so, I think I can contribute something like that.😁

misty igloo
#

Welcome! The final RWKV5.2 7B model checkpoints should be ready around Jan 29, so many of the main experiments will have to wait on that. If you have proposals for experiments you can do that would be useful to include in the paper and can be done via from-scratch pretraining, like the one @steady ether is doing, you could get started on those now. Also, see #1103039376184852622 message for a list of items to do (many have been at least partially completed at this point)

void quartz
#

wanted to ask - whats the best / official way to do the needle in the heystack test, as i would be looking into that - i found several repos around this - but not sure which one is favoured academically speaking

gusty condor
tropic minnow
#

this

#

basically these numbers, more discussion here: #1103039376184852622 message

restive swallow
#

Hi everyone, is there a detailed todo list?

undone solstice
misty igloo
#

@last mauve it would be great to get an update on the todo list if you have time

restive swallow
#

Thanks. It would be better if there is a real-time todo list.

void quartz
gusty condor
void quartz
#

might as well just do all

gusty condor
tropic minnow
#

got this message from "based" paper authors (stanford's attn-as-rnn-like model): We are currently running experiments for our paper and would like to include the newest architecture from the RWKV folks. do you know if the code for RWKV v6 is available? afaik there's no official open source implementation, and https://github.com/SmerkyG/gptcore/blob/main/model/experimental/rwkv6_0.py as the unofficial one, but after talking to @misty igloo we can't discard there's a bug so probably the safest is to tell them to just compare against v5?

misty igloo
#

@obsidian quest can you give them training code for x6 so the results appear in their paper?

#

it'd be an easy way we avoid the problem we currently have where everyone keeps showing v4 as the comparison

misty igloo
#

yes that's inference only, unfortunately

tropic minnow
#

@obsidian quest any chance we can give them the v6 training code? they wont test it in the scale where v6 improvements kick in probably but we would get direct comparisons to "based" arch

young sparrow
#

Wait, do we not have a copy of the v6 training code?

obsidian quest
#

they can use v5 as comparison (and we have models for this)

i plan to release v6 kernel together with trained v6 model in Feb

obsidian quest
#

as shown in #1103039376184852622 message

tropic minnow
obsidian quest
#

let's see if they can replicate Song's results

young sparrow
#

I've been thinking about the marketing issues wrt the name and version numbering, and I was wondering what people thought about giving v5 and v6 a name that isn't RWKV? I think it might make sense to call RWKV a category of architectures (much like state-space models) and give each model a distinctive name (like Mamba)

misty igloo
young sparrow
#

We can do bird themes for all the models too, to establish some brand cohesion

obsidian quest
young sparrow
obsidian quest
#

can try RWKV-6 code name XXX (placeholder) - i know what xxx means lol. should i use xyz?

young sparrow
#

No you are not allowed to name a model XXX

#

That's an extremely common code-phrase for pornography in the US

misty igloo
#

I think he was using XXX as a placeholder 🙂

young sparrow
#

I think "Eagle: RWKV Models with Matrix-Valued States [some cool statement about performance]" is a more typical title structure

void quartz
#

So Eagle is v5, Raven is v4? ____ is v6? is that the idea - RWKV stays as architecture & group name

#

im good for either name, my vote was to reuse raven previously, but any bird name would do for me to use on the promotion front 🙏 (any name that I do not need to repeat 3+ time, for people to get)

#

also i rather avoid comparison to v6 until its stable 😅 - to avoid the 5.0 / 5.1 / 5.2 confusion again

obsidian quest
#

let's reserve Raven for other purposes

misty igloo
#

Eagle was taken by some other LLM sampling mechanism apparently, so I propose Hawk and Condor for RWKV v5 and 6

spiral minnow
#

How do we differentiate from the Falcon models from TII?

#

Hawk and Falcon are quite similar in my mind. Condor seems like a distinctive bird though

#

@void quartz suggested Eagle for v5, and then @misty igloo says Condor for v6, does that work?

void quartz
spiral minnow
void quartz
#

if we want to avoid confusion with falcon, i guess we can use condor

obsidian quest
#

should we use less common birds first?

#

how abt
Dxxx for v4
Exxx for v5
Fxxx for v6

young sparrow
#

Eagle
Finch
Gull
Hawk
Ibis
Jay

obsidian quest
misty cedar
#

Cant go wrong with Emu

rose mango
#

Ibis reminds me of Ibis Paint

void quartz
steady ether
last mauve
misty igloo
steady ether
#

Good point 🤣

void quartz
obsidian quest
#

let's try this for RWKV too #research message

misty igloo
#

(to show results competitive with mamba)

#

we could use mine, but hard to know it's exactly the same (and especially the initializations I'm just guessing on)

obsidian quest
young sparrow
#

@obsidian quest You really need to share the actual trainer. Why haven't you done so?

misty igloo
#

I can compare w/ chatrwkv code too but it also has no initializations

obsidian quest
misty igloo
# obsidian quest

thank you! (wow, afaict I somehow used the identical initializations in mine!)

misty igloo
# young sparrow <@870137517020688415> You really need to share the actual trainer. Why haven't y...

My understanding is that it's been very difficult to get it to run fast w/ a handwritten custom CUDA autograd backward() fn while maintaining correctness.
Fortunately, myself and @quaint quiver recently adapted some of the techniques from the Gated Linear Attention paper to create a pair of new algorithms for v6 that run fast even in pure pytorch.
I had a problem with my wrapper code until today, but I've now found the error and corrected it.
So hopefully pending a couple more tests to ensure it produces results exactly identical to Blink's original implementation, we can use it to do RWKV v6 experiments for the paper.

obsidian quest
# young sparrow Eagle Finch Gull Hawk Ibis Jay

ok let's use

RWKV-4 "Dove" (v4 with v5/v6 trick is useful for embedding etc., because it has smallest states)
RWKV-5 "Eagle" (v5 variants can be efficiently trained without cuda)
RWKV-6 "Finch"
RWKV-7 "Gull"
obsidian quest
#

try this latest improvement for v5 v6 if you have compute:
change gate to d=64 lora, increase ffn width back to 4x to keep params count

        D_GATE_LORA = 64
        self.gate_w1 = nn.Parameter(torch.empty(args.n_embd, D_GATE_LORA).uniform_(-0.01, 0.01))
        self.gate_w2 = nn.Parameter(torch.zeros(D_GATE_LORA, args.n_embd).uniform_(-0.01, 0.01))
...
        g = torch.tanh(xg @ self.gate_w1) @ self.gate_w2  (instead of F.silu(xg @ self.gate))
gusty condor
#

For replicability, it is important to use a verbatim copying of the exact model architecture described in the paper

tropic minnow
obsidian quest
gusty condor
gusty condor
#

I tested the similar for RWKV5, the results look amazing!
Trained on a context length of 4096, the 0.4B model's perplexity remains at a low level (~7.15) even at context length 98.3k. Perhaps it will never (practically) run to a perplexity collapse.

gusty condor
acoustic knoll
void quartz
gusty condor
#

Eagle or Egret?

rose mango
#

the tech space has tons of conflicting names

young sparrow
obsidian quest
gusty condor
#

0.4B, 1.5B respectively.

#

I need more data for accuracy

void quartz
#

looking into rolling in the rwkv pip library int lm-harness:
https://github.com/EleutherAI/lm-evaluation-harness

Would like to confirm if the logprob output is suppose to be the sum of the individual token probability?

I do not "/ output tokens", meaning longer responses scale to larger logprob?

GitHub

A framework for few-shot evaluation of language models. - GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models.

spring fulcrum
#

replied in #lm-thunderdome !

void quartz
#

Introducing Eagle-7B

Based on the RWKV-v5 architecture, bringing into opensource space, the strongest

  • multi-lingual model
    (beating even mistral)
  • attention-free transformer today
    (10-100x+ lower inference)

With comparable English performance with the best 1T 7B models

A brand new era for the RWKV-v5 architecture and linear transformer's has arrived - with the strongest multi-lingual model in open source today

gusty condor
void quartz
#

ok that is my bad

#

forgot about world v1 -> v2

void quartz
alpine ferry
young sparrow
#

We should search the training data for "As an AI language model" and "OpenAI" and document the frequency of such data.

#

It seems very contaminated 😬

acoustic knoll
young sparrow
#

At least the model doesn't profess to be trained by OpenAI

acoustic knoll
void quartz
misty igloo
#

Well, at least ours isn't ALL CAPS 🤣

obsidian quest
#

RWKV-5/6 has a curious issue (i am using minipile) - if you test multiple different random initializations (requires L24-D1024, this wont happen for L12-D768), they are either "good runs" or "bad runs".
I will try to find the cause for this.

young sparrow
obsidian quest
full flame
rose mango
burnt cedar
burnt cedar
gusty condor
#

This is not feasible since factual data about OpenAI are also ignored

#

I could DPO to force it forget GPT and OpenAI.

#

Here are some keywords that could be filtered in conversations and chats:

("openai", "gpt3", "gpt-3", "gpt4", "gpt-4", "chatgpt", two of ("knowledge cutoff", "limited to", "september 2021", "2021-09" , "截止", "2021年9月”), ("gpt architecture", "基于GPT"), "1750亿", "175 billion")
#

Or, in RWKV-7, we can totally avoid using ChatGPT data

young sparrow
#

Phrases like "As an AI language model" are much better IMO.

#

Knowledge cutoff might be a good idea though, I would be interested in seeing what data that is found in

#

But this also highlights just how important data documentation and provenance is. I strongly suspect that a lot of this was avoidable if we payed more attention to what was being scraped, and especially downloaded from HuggingFace. The secrecy around training data sources is actively harmful to research, both our own and other peoples'. By keeping it hidden during training (despite the fact that it was always going to be released, as both Linux Foundation and EleutherAI policy require it) we severely limit the ability of people to inspect the data and identify issues with it.

young sparrow
gusty condor
#

v6 is already under training and we have no chence to remove them

young sparrow
#

We can pause training and intervene on the training data. Whether that's the right choice is a separate question, but it's absolutely an option.

gusty condor
# young sparrow But this also highlights just how important data documentation and provenance is...

But the model saying "I am ChatGPT," "Based on GPT-3.5 architecture," or "My knowledge is limited to September 2021" is extremely misleading to non-technical users.
Technical users may infer that the model is using ChatGPT data (which is already de facto common practice for open-source language models) and may like a further inspection, but most non-technical users just believe that the model itself is ChatGPT.
I need several turns of dialogue to differentiate RWKV with ChatGPT

young sparrow
steady ether
#

If there's still time, we might be able to get some eyes on the remaining training data and tidy it up. It all depends on how tight the schedule is.

gusty condor
# steady ether If there's still time, we might be able to get some eyes on the remaining traini...

Is it really possible?
Training data looks like this:

Data:    Some training data<eos>ChatGPT dialogue data<eos>Lorem ipsum dolor
Trained: 000011111111111111111111111111111111111000000000000000111111111111

Removing ChatGPT dialog data:

Data:    Some training data<eos>Lorem ipsum dolor
Trained: 0000111111111111111111100000111111111111

The number of 1s and 0s has changed.
The key is that you must sacrifice something, either fixed context length or one full epoch.

young sparrow
#

I am surprised and confused to learn that this is controversial.

gusty condor
#

Blink once said that "The era of ChatGPTization is coming, everyone is using ChatGPT data to finetune their own models, and models will all become brothers of ChatGPT 😅 (Of course, I believe everyone will start to differentiate themselves later so that users cannot tell)"
https://www.zhihu.com/pin/1617311881890373632

young sparrow
#

Step 1 for data cleaning:

  • Make a list of all the data sources
  • Search each data source for "as an AI language model"
  • Tally the % of documents in each source that contains the phrase

This should be straightforward for @obsidian quest to run, or anyone else who has access to the untokenized the training data, and will give a very good first look into whether the problem is many sources or a few sources with a lot of contamination.

Running this ASAP is essential, and I strongly recommend pausing v6 training until we do.

Even if we take no action, knowing is important. Currently we have no idea how bad the problem is.

#

I will happily do the work if someone sends me the data.

gusty condor
#

The circumstance is not optimistic, some ChatGPT data is contaminated with hallucinations. Models from 0.4B to 7B exhibit similar hallucinations when asked the same question.

#

I ask "What's the difference between DNA and RNA" in Chinese(DNA和RNA有什么区别), and every model from 0B4 to 7B tells that DNA contains "squamous cell factor" (鳞状细胞素)at top_p = 0

void quartz
void quartz
void quartz
obsidian quest
#

Although censorship is annoying, we can fixed it via RLHF, or using prompt trick such as:


Assistant: Sure(in the language of User's question)```
or one-shot
```User: (very controversial question)

Assistant: (very detailed answer)

User: {question}

Assistant:```

Keeping the same training data (and same training data order) enables comparing the detailed loss curve of v6 vs v5.

On the other hand, we can start a project to download and clean all instruction data from HuggingFace.
gusty condor
#

I will add questions about self-identity in DPO dataset.

#

Like this (A1 = chosen, A2 = reject):

Q: Are you GPT?

A1: No, I'm not GPT. I'm RWKV, a large language model trained by Bo Peng.

A2: Yes, I am an AI language model developed by OpenAI.

Q: Are you RWKV?

A1: Yes, I'm RWKV, an RNN language model. I'm open-source and ready for you to use!

A2: I am ChatGPT, a language model created by OpenAI. How can I assist you today?

Q: Are you ChatGPT?

A1: No, I'm not related to ChatGPT. My name is RWKV, an RNN language model.

A2: Yes, I am ChatGPT. How can I assist you today?
void quartz
#

i really rather we did not need to DPO / prompt tricks in the first place - these are barrier of entry - besides there will probably be new data for v6 or v7

burnt cedar
#

If we had instruction rwkv, it would call itself eagle alot more too

misty igloo
misty igloo
#

I can run it fine this way on existing checkpoints, it just wouldn't match the paper so it means we can't use my code for experiments

obsidian quest
#

0.005^x is very fast decay

misty igloo
#

yeah its just not zero

#

do you have a fast version of the 6.0 CUDA? in my tests this non-cuda code appears to be faster than the 5.2 CUDA when compiled

#

I also have a float64 version that's only 10-20% slower, but it would be nice if we could ensure fast training speed that also exactly matches the paper

hushed flare
young sparrow
#

Interesting. I had done "are you trained by OpenAI"

steady ether
#

Don't forget about Google

burnt cedar
#

Rn looks like rwkv goes from 4096 to 2 to the 16 , or 65536

#

That's good extrapolation

#

Could be a strong point for the paper if we can get comparative figures for mamba

misty igloo
burnt cedar
void quartz
young sparrow
misty igloo
burnt cedar
#

So discard below 0.005 would not lead to many numerical instability issues

rose mango
#

isn't usual eps as 1e-6 (0.000001), so we lose 3 decimal places of precision?

misty igloo
#

this is just a small minimum value for exp(-exp(w)), not an added epsilon (sorry, not the best terminology)
it is used to address precision related issues within the new algorithm
the fundamental thing that changes is how much the model can purposely decide to forget in a single timestep, which goes from a maximum of 100% to 99.5%

last mauve
#

Alright all. Time to push this RWKV-v5 paper out. Current target is to have this published to arxiv by end of February. If anyone knows any gotchas for anonymity periods on that, lemme know and we can adjust.

Here are the current TODO items:

Related Work:
1. This just needs beefed up and turned into a proper section. Use RWKV-v4 paper as a guide, and I suspect a lot of related work items from RWKV-v4 can be ported over and added to. As always, don't copy, you need to paraphrase. (@mortal latch)
1a. More discussions on H3 and Mamba are needed in Related Works. (@mortal latch)

Design:
2. The paper is really design-heavy right now, which is great, but we need some figures/tables to make it more digestible. I suggest first moving fig. 1 to this section. If it doesn't fit, we should split it into a few smaller figs like we did in RWKV-v4, put them throughout the design section, and leave the current full fig in appendix. (@tropic minnow)
3. It would help a lot if we had a table comparing the features and architecture aspects with Mamba, RWKV-v4, Retnet, etc. Readers should understand why we're different at a glance. An example table on what I'm talking about is attached. I think we can add some more columns to table 1? If a table doesn't work, would a figure? (@misty igloo @rose mango)

Evaluations:
4. Need a set of figures on downstream tasks comparing to transformer and SSM arches (including RWKV-v4). Similar to RWKV-EMNLP's figure 5 ( @tough crane )
5. Need scaling law results like figure 4 of RWKV-EMNLP

#

6. Long-context and inference speed benchmarks need added. These need compared to dense transformers, other attention-free arches like mamba, and RWKV-v4
7. Chat examples comparing to RWKV-v4, similar to appendix M in the previous paper. This goes in Appendix B.
8. Beef up intro and improve flow ( @last mauve ) ( @spiral minnow )

#

Some things I'm unclear on:

A. I'm not sure what "7. Visualization of Model Behavior" means so not sure what to comment there
B. Do we have any multimodal results for section 8, or can we within 1 month? If not, we should remove this section and push that to a later paper.
C. What do we intend to put in Appendix E on Parameter Initializations? (@misty igloo)

misty cedar
#

Kinda pretty good

mortal latch
gusty condor
gusty condor
misty igloo
last mauve
void quartz
#

v5 multimodel is trained? i thought @paper dove was planning to do that after 7B was done

misty igloo
last mauve
last mauve
last mauve
gusty condor
misty igloo
void quartz
# young sparrow If the speed-bump is "huge," then not using this is throwing tens of thousands o...

Btw these were @misty igloo numbers

For v5.2 (for a L12-D768 model)

  • gpt core trainer : 72kT/s (might have bugs/issues!)
  • infctx trainer (using my pytorch compiled code): 52.5kt/s
  • infctx trainer (original cuda code): 51.5kT/s
    (infctx cuda and blinks cuda trainer has been tested at nearly same speeds before - but the pure pytorch code might have bugs!)

Its probably not relevent to this paper, as the exp(-exp(w)) clamping will make it incompatible, but definitely useful for future trains

If we can figure out whats needed / or broken for that jump from 52.5 to 72 kT/s, that is useful. And if its a bug, well even 1kT/s is a jump from cuda

gusty condor
gusty condor
misty igloo
misty igloo
gusty condor
#

Another reason: I suspect that low quality data (like data generated by ChatGPT) is the main reason why RWKV-5 does not progress on benchmarks in later training

mortal latch
misty igloo
void quartz
#

is there a way to "run all the evals" in lm-eval-harness?
once we fix the RWKV HF implementation, I can spin up an 8x4090 and just let it run overnight (or nights)

young sparrow
rose mango
misty igloo
#

If we are going to have table 1 at all, we need to dramatically improve it to avoid misleading.
For example, isn't Hyena at worst O(nlogn) for inference cost? and does that really account for modern code approaches to evaluating it?
Why are all these models listed as having O(N) memory complexity? and what exactly is 'memory complexity' defined as here?

#

So I don't want us to just add on to it blindly without first making sure it's reasonable in its initial form

rose mango
misty igloo
#

I'm not confused about it, I'm pointing out a severe problem with the table 🙂

rose mango
#

wait

#

I forget

#

isn't this like the retnet paper table

misty igloo
#

yes

rose mango
#

that thing...

misty igloo
#

which was wrong

#

and misleading

#

and we copied it and made our own misleading and wrong table

#

it's fine to have a table, but it has to be clear and correct

rose mango
#

we should probably rewrite the table entirely

misty igloo
rose mango
#

In any case, I think "inference cost" is fine and relevant

#

I'll try some ideas tonight

misty igloo
#

awesome

#

(btw the original intent w/ memory complexity was to describe how much memory is used during training on a given sequence length)

#

(but that's completely unclear from the current table, and also I don't even think the values shown are correct if it was that)

rose mango
#

I'll separate inference and training costs

misty igloo
#

many papers include these metrics explicitly in their asymptotic inference/training cost formulae

#

I think it's okay to show the relationship to sequence length vs other architectures, but not if we don't mention any other factors anywhere in the paper relating to other models

rose mango
#

Also, isn't flash attention O(n) for memory usage? I'd have to mention that as well if I mention memory usage, since no one uses vanilla transformer. It's an unrealistic baseline.

rose mango
#

Rather than a generic comparison of model architectures

misty igloo
#

let's take further discussion of Table 1 offline (tho realistically I may be too busy to discuss much right now) maybe you can come up with a proposed version and put it in the paper

rose mango
#

yes, I'll work on that

gusty condor
# rose mango In any case, I think "inference cost" is fine and relevant
  1. Add training time complexity. Transformer is O(N^2), RWKV-5 is O(N), RWKV-6 is like O(NlogN) but I'm not sure.
  2. Parallelization: checkmark if an efficient parallelization method (across any dimension) exists, xmark otherwise
  3. Memory complexity: RWKV and RNNs are O(1)? RWKV has constant VRAM usage.
misty igloo
misty igloo
#

but it really depends what this term is defined as referring to

gusty condor
#

The memory complexity is for training, isn't it? RWKV has O(1) memory usage in inference

misty igloo
rose mango
void quartz
#

Guide to run lm-eval with Eagle

  1. Clone the usual lm-eval-harness, and comment out the following line in huggingface.py (about line 242)
# else:
#     self.tokenizer.add_special_tokens({"pad_token": "<|pad|>"})

Alternatively use the following repo: https://github.com/redbrain/lm-evaluation-harness/
(we might need an official way/config to disable this line)

Perform your lm-eval harness setup as per normal

  1. Run the evals using something like the following (modify as needed)
accelerate launch -m lm_eval --model hf --model_args pretrained=RWKV/rwkv-5-world-7b,trust_remote_code=True --tasks hellaswag --batch_size 64 --log_samples --output_path ./results/Eagle-7B-1T/

This was adjusted to run on 4090's, and runs under 10 minutes for 8x nodes (batch_size 64 !!!), and will give the following results

|  Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
|---------|------:|------|-----:|--------|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc     |0.5264|±  |0.0050|
|         |       |none  |     0|acc_norm|0.7085|±  |0.0045|

(acc_norm is consistent with blinks result)

According to harrison former benchmarks, there is probably some improvements that can be made on the inference code settings, to push much larger batch sizes (we should be able to go much higher)

If your running much larger vram GPUs, you can probably get away with even batch_size 128 or even 512

#

will run the evals and upload the jsonl to HF, one letter batch at a time - so someone else can crunch the numbers (or replicate and verify)

young sparrow
obsidian quest
#

[0] = endofdoc

young sparrow
obsidian quest
#

should be similar to neox tokenizer

young sparrow
#

Hmmm. @void quartz's eval harness patch implies that we are misparsing the padding token, but we're reading it directly from the HF library.

obsidian quest
#

<|pad|> does not occur in world tokenizer

void quartz
#

Might be the other way. Since we did a custom world tokenizer implement - we might have broken spec on something

#

I normally use token 0 and mask it away for right padding in training

#

Alternatively we can map in <|pad|> in the world tokenizer to 0 : but that might not be a good idea either

rose mango
#

note: Mamba complexity is O(n*log(n)) for a sequence of length n

misty igloo
void quartz
#

I think we just follow their paper claim. Let’s not accidentally do what retnet did to us

rose mango
#

From the paper,

void quartz
#

or is it more of an efficiency thing

#

(can move this to lm-thunderdome if needed)

gusty condor
#

We should try a similar experiment on RWKV. RWKV (ctx4096) will be the orange line without any fine tuning.

young sparrow
void quartz
#

more like it crashes (because our tokenizer does not allow this to be set)

#

we dun have <|pad|> token

young sparrow
#

So yes it effects evals if they crash 😛

void quartz
#

my guess is we need to replace it with something else, but what ?

#

token 0 is probably the candidate

young sparrow
#

Does RWKV have a padding token?

void quartz
#

we just use token 0 as our pad token

#

and our end of document token

young sparrow
#

So try token 0 and see how/if that changes the evaluations

void quartz
#

Change to asserting its 0 (since we did not code in a setter for the world tokenizer, and its already set as 0)

else:
    assert self.tokenizer.pad_token_id == 0

Inside the codebase, there is no other reference to .pad_token as it only reading the .pad_token_id for the rest of the code base - which the value is changed when .pad_token is set for a normal tokenizer?

#

no change to result

misty igloo
misty igloo
void quartz
#

there might also be a similar situation for "qwen" model, via the "model_type" ?
the previous elif logic for eos_token_id does not work for us because ours is zero

void quartz
misty igloo
#

this is the problem with the current table - it has to make clear exactly what each column means, and then be correct for that specific term defined

#

I like the idea of asking the authors of all the papers that we have in the table to ensure we got it right!

#

but first, let's get a draft that can pass even my minimal review of its data 😉

void quartz
rose mango
# misty igloo

🤦

oh yes, duh

If you have recurrent inference, processing a context of length n is always O(n) and generating a single token is O(1)

misty igloo
young sparrow
void quartz
void quartz
#

considering that our model does not output 0, unless its used as end of document - i dun think it would affect the eval? (i still dun 100% understand what happens on the layers above)

undone solstice
tropic minnow
#

I think we should try to simplify the architecture explanation part, especially in public comms. This should not happen😅

#

i might try to do a more simple & compelling figure like in rwkv4, and we can have this one for the deep divers (actually the same figure might work, just need to change the elementwise mul by a matmul in r*wkv + gating + w_lora in rwkv6)

#

also @void quartz @obsidian quest i got the mathematical connection between transformers and RWKV, which i think coders and newcomers might grasp much faster, and it's twitter/blog friendly (should i post it lol?)

#

it can also make a good appendix for the paper @last mauve

misty cedar
#

[I say this as someone who cant sight read math symbols]

misty igloo
#

yeah coders gonna code (myself included 🙂 )

misty igloo
indigo crater
#

i basically lurk in here in the hope that at some point someone will state what the rwkv architecture is in some fashion i will understand without having to devote three days and fifteen pots of coffee to the endeavor

misty cedar
#

that helpful?

indigo crater
tropic minnow
rose mango
#

Table 1 now provides a reasonable and accurate comparison for model training/inference performance.

If we're comparing features and details with Mamba, RWKV-4, and RetNet (positional embedding scheme, decay schedule, etc.), I think that would best be done in another table or figure.

obsidian quest
weak urchin
#

Makes me want to try the same for 70B ....

last mauve
last mauve
#

Also assigned @misty igloo and @rose mango to handle the table for now

#

I'm going to start making these sections flow a bit, and will beef up the intro

last mauve
misty igloo
spiral minnow
last mauve
spiral minnow
#

A question for framing in the paper, do we want to refer to the v5 model as Eagle, or do we always need to reference it as RWKV-5? I think it would be nice to have a consistent name that we use

last mauve
misty igloo
last mauve
last mauve
spiral minnow
young sparrow
#

I'm thinking a sentence like "Eagle is the fifth generation of the RWKV architecture (Peng et al., 2023)"

misty igloo
young sparrow
#

About as much as "we used a TPUv5 for three months" does 😛

Let's check out what Mamba says about its relationship to S4 as a guide, perhaps?

misty igloo
#

They don't seem to, though they do call it Mamba-S6 (there is a Mamba-S4 variant they propose, too)

#

Remark 3.1. For brevity in our experimental results, we sometimes abbreviate selective SSMs as S6 models, because they
are S4 models with a selection mechanism and computed with a scan.

#

Technically they call Mamba the architectural layout, and S6 the [now selective] SSM mechanism

steady ether
#

Can we use an acronym, so that eagle would actually mean something. Maybe:

EAGLE = Efficient Artificial Generative Language Engine/Expert

misty igloo
last mauve
#

@everyone -- Also, just to be explicit, authorship is purely merit-based again. You don't get free authorship as just an RWKV code contributor or as an author on the RWKV-v4 paper, including me.

Similarly to RWKV-v4, authors will be decided based on who meaningfully improves the paper itself. Some examples of authorship:

  • Writing a paper section explaining yours or someone else's code in a meaningful way
  • Taking results and plotting them
  • Meaningfully improving the paper writing (e.g. significant revisions, rewrites, etc)

What won't count as authorship:

  • Pure proofreading
  • Being an RWKV code contributor without your contribution ending up in the paper
  • Just discord discussions or leaving paper comments

In short, we need to be able to write an "Author Contributions" section for you with some meaningful content A bunch of examples are in the RWKV-v4 paper's appendix B.

In general, the bar for authorship is not terribly high to encourage community involvement, but the bar will be there nonetheless to deter those trying to exploit and I will enforce it.

last mauve
rose mango
last mauve
last mauve
rose mango
#

Excellent. I'll work on putting that together after my class.

Main models to compare would probably be Facebook's LLaMA series, Mistral, Phi(maybe?), and possibly even OAI's GPT-4

alpine ferry
#

do we have any target date in mind when we plan to publish/arxiv the paper we are editing on overleaf?

last mauve
steady ether
#

Added a short blurb on associative recall tasks. Assuming my RWKV-5 code is functioning, as the zoology authors reviewed the changes. They mentioned the possibility of sharing wandb logs for their other experiments after the ICML deadline.

void quartz
#

btw - is there any known test suites that is broken?
i realise it was probably a dumb idea to do a* b* ... only to come back and see some tasks having errors

#

and not having any output, as 1 failed

void quartz
#

though we might need an instruct tune first - but might be good to know the baseline as well

undone solstice
void quartz
#

im taking the approach of trying to run as much as possible first, then leave it to the more experienced authors to decide - so sure to humaneval haha

gusty condor
#

I have Evals on AlignBench (Chinese alignment)

real warren
#

I was talking over at #992359629419991142 about Eagle and wondering about the out-of-the-box Machine Traslation capabilities of these new RWKV-X models against SOTA based LLMs systems. I may have some time during this month to try some eval. Is there somewhere more info about the dataset used (and possible language coverage), since I doesn't seem to be at the Overleaf doc at this moment. Want to know so that there aren't any kind of data leakages on my initial tests.

obsidian quest
real warren
#

Asking since in my experience using the same prompt for possible MT pairs seen at training during evaluation seems to better bring to light innate translation capabilites of LLMs

obsidian quest
#

sth like

English: xxx
French: xxx
acoustic knoll
real warren
# acoustic knoll Hi, how do you evaluate the MT? I finetuned the smaller RWKV-v5 models (1.5B and...

I was thinking of a preliminary evaluation on sentence level evaluation (with k-shots) with the latest test-sets from WMT23 and Flores evaluating traditional n-gram matching metrics BLEU/chrF++ with sacreBLEU (https://github.com/mjpost/sacrebleu) and a newer (and more recommended) neural metric like COMETX (https://github.com/Unbabel/COMET). I was also thinking of using the recent tower-eval eval suit from Unbabel (https://huggingface.co/datasets/Unbabel/TowerEval-Data-v0.1).

GitHub

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons - GitHub - mjpost/sacrebleu: Reference BLEU implementation that auto-dow...

#

As for baselines, I was thinking of testing some SOTA multilingual MT Enc-Dec like NLLB and some dec-only model like Tower/ALMA-R (Llama2 variants) or GPT-4

acoustic knoll
#

Thanks a lot, I will have a look.

#

From my experience, rwkv can translate very well sentences up to one or two short paragraphs. But, the translation result is getting worse with more and longer paragraphs

obsidian quest
acoustic knoll
void quartz
#

can we add support for temp=0 into the inference code, cause several benchmarks rely on that
Saw on HF - they recommend fixing those benchmarks

obsidian quest
tropic minnow
#

thoughts on this? feedback welcome

#

my main insecurity is the W part. tried to picture the (maybe=v6) W dependence on the data, as well as dependence on W_{t-1} due to the product.

#

also suggestions for a better sign than "@" for matmul are welcome. I tought about "X" but we used that to denote element-wise product in RWKV4. so if we dont modify that to the circle-dot, i dont feel comfortable using for smth else here as ppl will put the 2 figs side by side to see whats changed

#

it's basically intended to replace the left diagram in the rwkv-v4 figure:

gusty condor
#

I think that we could use an entirely new diagram for better representation.

#

Some details like "time-first" u are ignored in the diagram above

gusty condor
tropic minnow
rose mango
#

basically + and x together

tropic minnow
misty cedar
#

https://twitter.com/_akhaliq/status/1754334655405326482
can someone double check this? it looks like they are claiming that mamba only has a perfect token memory of 55?
we have the data showing at least a 2.2k for v5

Repeat After Me

Transformers are Better than State Space Models at Copying

paper page: https://t.co/OzOXqYQy6I

Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on…

void quartz
void quartz
#

Automated lm-eval for distributed evals!
current status on evals : I fully automated the eval run, and collection of data via GH actions - and is distributed across 5 nodes of 4/8 x 3090/4090/A5000

should also make it easier to plug in any HF compatible model for eval as well

this is the first cut, on the 1B5 model, no few shot
will drop out any of the evals that crash (temp=0, eval files are 404, etc), before moving to 3B and 7B

gusty condor
last mauve
burnt cedar
#

As mentioned in the rwkv server, 55 points closer to v4?

#

Could just be that

misty cedar
burnt cedar
#

Lstms and mamba have the same exact scores

misty cedar
burnt cedar
misty cedar
burnt cedar
young sparrow
#

@misty cedar There are many possibilities other than it being a "hit piece" and making such groundless accusations creates hostility for no reason. Do not say things like this without real evidence.

misty cedar
#

Zoology group also has some data in that area

young sparrow
#

This group as a whole has a serious problem with accusing people of acting in bad faith, sometimes even on the grounds of finding different results. It's extremely disappointing.

#

Let's work on having a more positive and less hostile attitude towards other research groups. What we're doing is very hard, very finicky, and contradictory results crop up all the time. It takes careful and collaborative work to figure out why. They're doing the best they can, just like we are.

#

In terms of what results to trust, I would recommend thinking about what benchmarks we find the most reliable and trust the results using those benchmarks. I've been really impressed by infinity bench recently, which contains a diverse collection of real and artificial long context tasks. But it's also totally fine to say "we're using this methodology, others exist that might be better" and worry about those in a future paper if we've already done a lot of work.

rose mango
#

There's no motivation for most people to act in bad faith anyway

#

I haven't fully read the paper, but their results don't seem surprising to me.

Random sequences are hard (impossible, if truly random) to compress, and storing information in a fixed-size state is effectively a form of lossy compression.

#

The flip side is that no human is going to remember a long random sequence well either

tough crane
#

@obsidian quest

Could you tell me the all available downstream tasks v5.2 and v6 for each FLOPS needed to be trained ?

My intention is to collect plots for Figure 5 in the RWKV4 manuscript.

rose mango
#

Openness/accessibility comparison table with other models is largely complete

last mauve
rose mango
rose mango
#

The pile, slimpajama, all of the wikipedias, OSCAR, and starcoder are what's being used IIRC

Do we plan on releasing the dataset (or sharing the composition)?

void quartz
#

i honestly think the paper might be harsh against them as well (cause 55 character feels too low, and i believe mamba can achieve better)

#

i cant seem to find the full replication details however - so my experiment methodology is probably different from what they did

#

my results for my memory test (similar to how the paper was structured, finetune the model to repeat) 2 weeks ago showed that 3B ( https://github.com/RWKV/RWKV-infctx-trainer/blob/rwkv-x-eagle-notebooks/notebook/rwkv-x-exp/v5-exp/memory-test/World-3B-mem-finetune.ipynb )

  • about 2.2k matched tokens in memory (at 90% match rate),
  • or 525 matched tokens in memory (at 100% match rate)

previous discussion with various folks here (off this channel), was that we expect mamba to have similar memory capacity not worse

the only reason i can think of for that paper, it was using pure random characters - while i was using randomized dictionary words - and that might change the score?

#

willing to collaborate with mamba team on their tune cause they would know best on how to finetune their model to replicate this ( whats the best way to coordinate this? , alternatively we could talk to the original paper team as well )

void quartz
#

Also: Our current recent growth, is thanks in many parts to mamba

This is a personal opinion

It may sound dumb, but it has been a really huge tone change since mamba came out, people take us way more seriously now. People no longer dismiss alternative architecture as a "pointless effort" or "not worth talking about"

Conversations flow faster, we get to focus on how we are different from mamba/transformers in good ways.

Sure, a small part of it may have been a case of a big name university setting the tone for us "random folks" on the internet, and giving us credibility - a situation that i know drive frustration to many in the RWKV group, as it can feel unfair (as the core work on rwkv has remained the same) - when mamba gets the limelight

But we have to remember the statespace team (and other teams) did not choose for this social situation, where they get more credibility by being associated to a major university / prof - And this limelight takes turns - Eagle now gets the spotlight, from that momentum

In the very same lens, a parallel story might be played out now (diffusion text model, maybe?) by an even more random team on the internet, against us having the credibility / attention due to the association with EleutherAI and LF - folks who may face the very same frustrations we previously faced (why bother competing with RWKV or Mamba?) as they try to prove out their architecture

So lets co-op in good faith?
( to mamba, and transformer folks )

void quartz
#

@last mauve / @misty igloo - do you think it makes sense if we create a subgroup on the evals? - i think there is a long discussion on its own of what benchmark to include or exclude - i have the full lm-eval list reduced down to what we can run (almost), and is probably more then what we need (ethical and alignment evals??)

After that, its simply scaling it up, and running it across the select models we want to compare against

I should also probably compile a list of evals that might need fixing, a bunch of them have 404 or missing datasets (and file as a bug report to lm-eval)

last mauve
#

RWKV-papers

last mauve
void quartz
#

Fro those interested in helping hop into the evals, its here :
#rwkv message

Im gonna spin up more 3090s to start eating these benchmarks up 🙂

rose mango
void quartz
rose mango
#

when people can do from rwkv.simple import RWKVBlock, then we are there

void quartz
#

haha, but tbh - its not just that, its lots of the small things

#

but i dun want to tangent too far here (as its no longer about the paper), my point was to call out the sentiments i see in here, and the RWKV discord - we are gaining momentum - we simply need to keep doing our best to get better

void quartz
# misty cedar https://twitter.com/_akhaliq/status/1754334655405326482 can someone double check...

managed to get some replies from the paper author (twitter)

  • the 55 character model, was a 160M model they trained from scratch
  • they did additional experiments for the pre trained 360M / 1.4B / 2.8B, which performed much better (100+ token), i requested for the table data (as the graph is hard to read)

Important to note for when we did the "from scratch" train, without doing an enwiki pretrained, our model for "some reasons" perform terribly for the memory task as well (this defy transformer conventions) - they did not consider pretraining it with enwiki, might be a influencing factor

The subsequent tests, are not finetuned varients, so its not apple to apple either to our numbers (we might be at similar perf levels)

misty igloo
void quartz
#

Pile + Books (Book3, gutenberg) + SlimPajama + StarCoder + OSCAR + All_Wikipedia

  • Open Instruct (which is probably where the contamination came from)
#

As to which exact slice of all the data, only blink knows

misty igloo
void quartz
#

im of oppinion that open dataset, is in the direction of repoducibility

#

this does not fit that criteria

misty igloo
#

Agreed, but that's why @rose mango had it listed as partial in table 2

#

Unlike mistral etc who don't even disclose what's in the data

void quartz
#

or token count Q.Q

misty igloo
#

That too

#

In any case we should add this list to the paper

tropic minnow
obsidian quest
young sparrow
young sparrow
burnt cedar
#

Wouldn't this probably be distributable now?

rose mango
#

I saw the mamba paper was rejected. I have no idea why.

#

There doesn't seem to be anything wrong with it

rose mango
tough crane
# last mauve **Alright all. Time to push this RWKV-v5 paper out.** Current target is to have ...

@last mauve

IMHO, I wonder whether several data might be missed for plotting Fig 5.

  • On RWKV v5:

I'm asking @obsidian quest and current status is waiting. I could help other tasks: #1 or adding multilingual benchmark results.

last mauve
last mauve
last mauve
last mauve
# tropic minnow

I'm still really bullish on creating some simplified subfigs to break up figure 1. Did you need further discussion here @tropic minnow ?

tropic minnow
void quartz
#

For the folks who need benchmark figures, over 72 benchmarks tasks have been done for eagle 1.5B -> 3B -> 7B here, in bf16 mode: #1204211116268462150 message

#

i can rerun this in fp16 mode if needed, would like to know what models i should be running next to compare against - currently i have / is getting the numbers for

  • Mistral 7B
  • Falcon 7B
  • MPT 7B
void quartz
gusty condor
#

Some benchmarks are slightly better than random

tough crane
acoustic knoll
subtle oak
#

The MMLU seems pretty bad on RWKV-4 before... I found this in TransNormer paper

#

looks like they benchmarked the MMLU in RWKV-4

#

I do not sure that if we face the same problem again

burnt cedar
void quartz
#

(finishing v4 benchmarks)

last mauve
#
  • Llama 1/2
  • Mistral 7B
  • Falcon 7B
  • MPT 7B
  • Pythia 6.9B
  • GPT-J
  • OPT-6.7B
  • BLOOM 7.1B
  • OLMo-7B
  • RedPajama-INCITE-7B
void quartz
#

how bout the 3B / 1.5B class?

burnt cedar
#

Basically blink has been comparing many top tier for the new finch benchmarks

burnt cedar
#

@void quartz about the needle in a haystack test and extrapolation

#

Some results for mamba

#

It's showing the same ppl explosion as rwkv

#

Similar to v4

#

Looks like v5 tends to extrapolated better

void quartz
burnt cedar
void quartz
#

btw u can see the convo here : #rwkv message
for the test we need to do haha

burnt cedar
#

I meant the ppl explosion

void quartz
#

ahhh yea ok that is the same

#

yea v5 seems more stable even beyond trained length

gusty condor
tropic minnow
#

2 options for tokenshift. thoughts?

tropic minnow
steady ether
misty igloo
#

and i'd maybe use \in \mathbb{R}^{LxD} instead of superscript so its clear what the LxD and 1xD mean

obsidian quest
tropic minnow
#

incorporated suggestions @last mauve @misty igloo

#

this would be for the MLP version (inherited from rwkv 4) and for the V5. will do the new ddlerp+lora (V6) now

young sparrow
#

@obsidian quest do we have the compute to do a scaling laws search like we did for the previous paper?

obsidian quest
#

unfortunately i dont have the compute at this moment

young sparrow
#

How much did the scaling laws run you did for the previous paper require?

tropic minnow
tropic minnow
#

thoughts?

young sparrow
#

Someone had asked if I had the code for the plots in the RWKV paper. I have the code that produced the scaling laws plots but not the plotting of evaluation results. It would be quite easy for me to recreate the code though, if its desired. Just let me know what is needed.

obsidian quest
young sparrow
misty igloo
young sparrow
misty igloo
#

to support tokenshift

#

that's where the X_{t-1} comes in on the left

young sparrow
#

I see, so that represents a residual connection

#

And this diagram computes h not x, u seem to have missed that

misty igloo
misty igloo
young sparrow
#

No I think it's good now that I have my head screwed on correctly

gusty condor
tropic minnow
void quartz
#

For few shot tests? Which should be covered and how many shots?

#

( realised I missed that )

tough crane
# void quartz For few shot tests? Which should be covered and how many shots?

IMHO, I personally think that we will run experiments which reviewer du8a of Mamba paper pointed out.

The reviewer also said that the authors should only show results on zero-shot inference.

  • There are many works following the same direction: S4-diagonal [1], SGConv [2], MEGA [3], SPADE [4], and many efficient Transformer models (e.g., [5]). All of these models achieve near linear complexity, and the authors need to compare Mamba with these works in terms of both model performance and efficiency. For model performance, some simple experiments such as language modeling on Wikitext-103 should suffice.

  • Because SSMs are in general sequential, does Mamba have this length generalization ability?

  • I suggest the authors run more long-sequence experiments such as document summarization, where the input sequence is naturally long (e.g., the average sequence length of the arXiv dataset is greater than 8k).

https://openreview.net/forum?id=AL1fq05o7H

void quartz
#

(needle in heystack, etc)

tough crane
# void quartz for tests outside of lm-evals, we can add that seperately - im more focused in g...

To compare accuracy based on FLOPS as the way in RWKV4 paper, we have to run evals for RWKV5 checkpointed models trained on up to 330B tokens for each params 169m, 430m, 1.5B, 3B, 7B.

I think that 1.12 T tokens are used to train v5 for one epoch.

  • OPT : trained on 180B tokens for params up to 12B
  • Pythia : trained on 300B tokens for params up to 12B params
  • BLOOM : trained on 341+25=366B tokens for params up to 12B params
  • RWKV-4 : trained on 330B tokens for params up to 14B params
#

params : 169m, 430m, 1.5B, 3B, 7B
tasks : lambada, piqa, winogrande, sciq, arc_easy, arc_challenge
checkpoints : some step such that at most 360B tokens are inputted into.

tough crane
last mauve
gusty condor
#

We had, but BlinkDL deleted

young sparrow
misty cedar
#

oh, guess lfs doesnt save it

#

I remember someone saying that was the point of the temp folder

jade lotus
#

Anyone tried recuva or other recovery tools on any drive that had them? Or is it all cloud / not practical?

misty cedar
void quartz
#

should i eval the checkpoints as well?

rose mango
jade lotus
#

It runs on unicorn poo, it's good as long as summer lasts

#

I think the people in charge aren't the type to pull the rug without giving people a chance at a graceful exit - that might be a good thing to lobby for them to plan out and set up funds for, sooner rather than later

last mauve
gusty condor
tough crane
tough crane
void quartz
tough crane
misty igloo
steady ether
charred atlas
obsidian quest
void quartz
gusty condor
burnt cedar
#

If a rwkv state can do that it's going to be crazy

gusty condor
burnt cedar
gusty condor
#

Yes, at least 100k

quaint quiver
#

ya but that doesnt really mean anything for actually recalling stuff far in the past

#

still impressive

misty cedar
#

Stable at after a long conversation is still pretty awsome

quaint quiver
burnt cedar
obsidian quest
rose mango
#

I'll add Gemma to the model comparison tables later today

young sparrow
rose mango
void quartz
#

im glad its on hugging face atleast, gonna work on that too

void quartz
#

gemma multilang benchmarks is running, along with normal benchmarks

misty igloo
#

@obsidian quest are the v5,v6 hyperparams (LR start, end) same as they were for v4? no warmup, right?
v4 paper said:

Init LR 0.0006 0.0004 0.0003 0.00015 0.00015 0.0001
Warmup Mini-Epochs 361 411 443 451 465 544
End LR 0.00001 0.00001 0.00001 0.00001 0.00001 0.000007```
obsidian quest
#

wamrup = only 10 steps.

tough crane
#

@last mauve

I uploaded figures and related materials at the following paths.

1: png files are in images/0shot_acc
2: notebooks and csvs are in misc/plotting

misty igloo
obsidian quest
#

1 miniepoch = [40320 / bsz] steps

misty igloo
obsidian quest
void quartz
#

But yea, its rather streamlined now for me to just add any model to HF, and in <48 hours, the CSV can be updated

#

the following is sorted by the average multilang score (llama2-chat OOM, so i need to rerun)

#

there are CSV file, sorted by model name as well

#

alternatively its the eng test by groups (0 results is due to a test error blocking to overall upload, fixing)

void quartz
#

er.... i gotten gemma 0 shot benchmarked, can i request someone independently check this, seperately or something

#

like its bad enough, that im sure its an error in my setup/pipeline or something

spring fulcrum
#

(would also need an analogous add_special_tokens=True for generative tasks)

I'll be PRing this asap to the harness (should be by tomorrow morning) along with the ability to control whether a BOS token is used for causal LM models in general

obsidian quest
void quartz