d_ff/d_model + swiglu tests | EleutherAI | Page 2

dawn vine Jan 22, 2024, 5:20 AM

#

I can share these extra experts across layers, per your idea (that everyone seems to want to try)

fallen spear Jan 22, 2024, 5:34 AM

#

it does seem like a free lunch

dawn vine Jan 22, 2024, 5:39 AM

#

fallen spear it does seem like a free lunch

have you seen anyone use 'attention experts' to similarly slice up q,k,v generation?

fallen spear Jan 22, 2024, 5:40 AM

#

dawn vine have you seen anyone use 'attention experts' to similarly slice up q,k,v generat...

there is a paper that i and several other people remember reading that showed that ff was better

#

none of us remember what it is

#

it was probably a while ago

dawn vine Jan 22, 2024, 5:41 AM

#

thanks!

#

still some question of whether it might be useful in the context of DeepSeek slicing tho and plus your all-layers idea

fallen spear Jan 22, 2024, 5:44 AM

#

it is probably worth revisiting. since none of us remember which paper it was it is probably kind of old

#

it might be in the bibliography in the moe reading group channel somewhere

dawn vine Jan 22, 2024, 5:44 AM

#

maybe a better use of parameters would be how they do it in Mamba, where they unify FFN and Attn into a single block, then double layer count

#

then you can apply the experts to both at once

#

(expansion and contraction matrices constitute an 'expert' here)

fallen spear Jan 22, 2024, 5:45 AM

#

if you slice it maximally your "experts" are all single vectors and a single vector is agnostic for all purposes except magnitude

#

i shouldn't say magnitude, i guess i mean 'dimension' -- how many scalars are in it

dawn vine Jan 22, 2024, 5:46 AM

#

yeah its true fundamentally 'where its most useful' is kind of a perpendicular concern to how MoE is done
which is nice

fallen spear Jan 22, 2024, 5:46 AM

#

dawn vine yeah its true fundamentally 'where its most useful' is kind of a perpendicular c...

sort of but in practice no since nobody is deranged enough to use a single expert interchangeably in multiple places yet

#

just, in principle if sharing "experts" around you can put them anywhere they are correctly sized

#

whether it is a good idea is more difficult to determine

dawn vine Jan 22, 2024, 5:53 AM

#

btw, I know you're hardware challenged so maybe this doesn't matter, but in case it proves useful it was hard won knowledge for me: don't bother running MoE on multiple 4090's unless they're NVLinked - you have to use the cards Nvidia doesn't hobble their cross-card bus comms, or the communication will destroy your speed completely

fallen spear Jan 22, 2024, 5:55 AM

#

i'm on 3090s now, nvlink is slow to show up, my primary current bottleneck is being easily distracted by devopsy stuff

dawn vine Jan 22, 2024, 5:56 AM

#

well any consumer level card they make all the comms go across the CPU or some crazy thing, and it makes MoE untenable across multiple cards

fallen spear Jan 22, 2024, 5:56 AM

#

yeah, unsurprising. i guess: unless you have a full copy of every expert on each card

#

but then it will have to be very small. etc

dawn vine Jan 22, 2024, 5:57 AM

#

yeah but the whole point of multiple cards in this context is usually to use Expert Parallel

#

so you can fit it all in VRAM by splitting em across cards

fallen spear Jan 22, 2024, 5:58 AM

#

yeah ... idk, the scale at which moe seems viable leads me to question doing them on less than an a100 x4

dawn vine Jan 22, 2024, 5:58 AM

#

agreed

#

this is one reason it may be cool to use the cross layer experts tho

fallen spear Jan 22, 2024, 5:58 AM

#

the itty bittiest moes are still monsters of vram

fallen spear Jan 22, 2024, 5:58 AM

#

dawn vine this is one reason it may be cool to use the cross layer experts tho

yes, that and the thing with slicing layers up into sublayers are both viable small scale

dawn vine Jan 22, 2024, 5:59 AM

#

fallen spear yes, that and the thing with slicing layers up into sublayers are both viable sm...

so you may be able to use 8x experts but not increase the total param count at all, and still get some significant benefit (TBD!)

fallen spear Jan 22, 2024, 5:59 AM

#

yup

#

imho vs deepseek hash routing's multihash is "cleaner" as an idea

dawn vine Jan 22, 2024, 6:00 AM

#

instead of MoE we should call it slice-and-dice FFNs

fallen spear Jan 22, 2024, 6:00 AM

#

but they barely mention it

#

it is a frustrating paper

#

because it demonstrates that things which should not work do work

dawn vine Jan 22, 2024, 6:00 AM

#

i gotta look at multihash...

fallen spear Jan 22, 2024, 6:00 AM

#

and then fails to extend it

#

sec

fallen spear Jan 22, 2024, 6:02 AM

#

dawn vine i gotta look at multihash...

https://openreview.net/pdf?id=lMgDDWb1ULW
starting on pg 8, has one table, is also in appendix a

#

i have mentioned this paper ad nauseum in the moe reading group channel and also read it for the group, which is on eleuther yt

#

so will refrain from restating too much more of that

#

my mistake: also mentioned on page 3 towards the end

#

and that copy omits the appendix

#

https://arxiv.org/abs/2106.04426

arXiv.org

Hash Layers For Large Sparse Models

We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is comp...

dawn vine Jan 22, 2024, 6:06 AM

#

so just use M hash functions, one for each of deepseek style M slices?

fallen spear Jan 22, 2024, 6:06 AM

#

yeah. or: whatever you would normally use to route

#

hash routing is amazing bc it works at all

#

hashing the index of the token is a rock stupid method that still works

#

it is underexplored and can fail in weird ways and you probably don't wanna fuck with it if it's not your main goal to do so

dawn vine Jan 22, 2024, 6:07 AM

#

sounds a little like this stupid/great paper
https://arxiv.org/abs/2401.02994
(basically, just run a totally random model of 8 for each subsequent token so they 'collaborate')

arXiv.org

Blending Is All You Need: Cheaper, Better Alternative to Trillion-P...

In conversational AI research, there's a noticeable trend towards developing models with a larger number of parameters, exemplified by models like ChatGPT. While these expansive models tend to generate increasingly better chat responses, they demand significant computational resources and memory. This study explores a pertinent question: Can a c...

fallen spear Jan 22, 2024, 6:07 AM

#

but their multihash is just "what if instead of routing once you routed eight times"

#

turns out it works great

#

this naturally raises the question of why not route d_model times

dawn vine Jan 22, 2024, 6:09 AM

#

well thats your usual thesis, right? (assuming it can be done performantly)

fallen spear Jan 22, 2024, 6:10 AM

#

yeah basically

#

if the routing itself is fast and doesn't degenerate, doing d_model routings is at worst the same as doing one routing

#

basically the hash routing paper suggests the type of conclusion i find the most enticing

#

which is that some large and complex component is worthless

#

in this case, trainable routings

dawn vine Jan 22, 2024, 6:13 AM

#

can u do it performantly just using torch.gather?

fallen spear Jan 22, 2024, 6:13 AM

#

maybe? i honestly do not know

#

i'm probably out for the night

dawn vine Jan 22, 2024, 6:13 AM

#

k gnite! gonna keep running MoE experiments 🙂

#

thanks for the ideas!

boreal moss Jan 22, 2024, 4:21 PM

#

dawn vine well any consumer level card they make all the comms go across the CPU or some c...

amd radeons can communicate directly, but pytorch support is junk of course

soft bobcat Jan 30, 2024, 8:05 AM

#

from https://arxiv.org/abs/2401.14489

arXiv.org

The Case for Co-Designing Model Architectures with Hardware

While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL tra...

fallen spear Jan 30, 2024, 8:43 AM

#

i saw that! it is interesting that llama seemed to have decided that they didn't really care how wide the layer was exactly as long as it was a good multiple of the gpu ... whatever the thing is called. pool? the thing

soft bobcat Jan 31, 2024, 8:49 AM

#

if you use 3 input layers to the activation, the dimensions become 2x from 8/3x

#

however, I haven't yet found a layer that beats 2 input layers

#

I think it vaguely matters because of wave quantization; if both d_model and 8/3 d_model must be convenient for wave quantization, that could be an annoying restriction

fallen spear Jan 31, 2024, 3:14 PM

#

soft bobcat I think it vaguely matters because of wave quantization; if both d_model and 8/3...

if you link me to something explaining what wave quantization is i will owe you the soul of my firstborn

soft bobcat Jan 31, 2024, 4:00 PM

#

https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization

it was also mentioned in the article linked above

NVIDIA Docs

Matrix Multiplication Background User's Guide

GPUs accelerate machine learning operations by performing calculations in parallel. Many operations, especially those representable as matrix multipliers will see good acceleration right out of the box. Even better performance can be achieved by tweaking operation parameters to efficiently use GPU resources. The performance documents present the...

topaz marten Feb 1, 2024, 10:57 PM

#

Would be interesting to design a model specifically for some batch size given some specific gpu

fallen spear Feb 1, 2024, 11:04 PM

#

topaz marten Would be interesting to design a model specifically for some batch size given so...

i personally would probably be inclined to target getting exactly under 48gb of vram

#

new thought about zerO init: would it suffice to add randomness by permuting your hadamards

#

assuming you have more than one

#

permuting them differently would produce a different result

#

assuming you did not permute them identically

fallen spear Feb 3, 2024, 5:50 AM

#

@soft bobcat @still grail can yall document any fresh runs on weird activations here or ideas therefore so i don't have to search ot for it tomorrow

#

@sage jetty if you want to noise up the channel I would not be mad either

soft bobcat Feb 3, 2024, 5:51 AM

#

sure, but most likely that's the final run

fallen spear Feb 3, 2024, 5:51 AM

#

i will hopefully be running these on pythia 14m tomorrow

soft bobcat Feb 3, 2024, 5:52 AM

#

x_mult = x1 * x2 * x3
        x = torch.tanh(x1) * torch.sign(x_mult) * torch.pow(torch.abs(x_mult) + 1e-8, 2/3)
``` was fern's idea

#

x = torch.tanh(x1) * x2 * x3

was something I've had lying around, but it's not my best

still grail Feb 3, 2024, 5:53 AM

#

yeah i think this is reasonable but hard to predict ahead of time when it would do well, it sorta makes sense to me at leasties thosies.... ❤️ :'))))

soft bobcat Feb 3, 2024, 5:54 AM

#

the bad news is that I don't have a diversity of great activation functions lying around

#

some of them are just better than others

fallen spear Feb 7, 2024, 5:09 AM

#

apparently clipping the up projection on pythia 70m only takes off 10m parameters

#

this seems incorrect but i'm going with it

still grail Feb 7, 2024, 5:25 AM

#

fallen spear apparently clipping the up projection on pythia 70m only takes off 10m parameter...

Check the size of your output layer in param &s.

fallen spear Feb 7, 2024, 5:45 AM

#

still grail Check the size of your output layer in param &s.

it's rather large

still grail Feb 7, 2024, 5:45 AM

#

Ye

fallen spear Feb 7, 2024, 5:45 AM

#

10m total for ff still feels absurdly small

fallen spear Feb 7, 2024, 6:31 AM

#

oh, fifty million of the params are vocabulary

#

okay sorry i realized it would be high but not that high

#

i've cut my non-embedding params in half, that feels correct actually

#

51m params embedding layer on a 70m model and i just shaved off half of my non-embedding parameters, what do we bet on relative performance

#

pythia 70m on owt2, 13b tokens as baseline:
https://wandb.ai/segyges/neox/runs/h1gsuoo9
how much worse is it with no ff, which shaves off 10m params

W&B

segyges

Weights & Biases, developer tools for machine learning

#

for the sake of argument i'll say identical but it won't be

#

it does not seem to give a shit

#

so far

#

will see once the entire thing is done

#

zoomed in last hundred steps

#

yeah so my bet is they converge before the run ends

#

like, completely

#

my second bet is that they don't converge before the run ends, but do if i zero initialize both for reruns

#

my third bet is that if someone lets me pursue this harebrained ablation at a non-toy-model size where ff is actually the dominant part of model size the closeness of fit remains regardless of whether either preceding bet pans out

#

at which point maybe you are better off training for ~4x as long

soft bobcat Feb 7, 2024, 6:52 AM

#

they probably won't converge. that thing is a loss spike for green so it got a little worse at that point

fallen spear Feb 7, 2024, 6:52 AM

#

i love it when my bets aren't entirely theoretical

fallen spear Feb 7, 2024, 6:53 AM

#

soft bobcat they probably won't converge. that thing is a loss spike for green so it got a l...

my reasoning is that the gap is larger earlier

#

for scale of how small the gap actually is

#

the reason i will be wrong is if the reason this is so miniscule is that 5/7ths of the model's brain is in its embedding layer

#

and the ff actually matters at non-toy scales

#

i expected it to be less nice and to have to fiddle the initialization

#

at this scale

#

so that's pleasant

tardy trench Feb 7, 2024, 7:04 AM

#

What are you guys considering to be convergence?

fallen spear Feb 7, 2024, 7:05 AM

#

lr has a minimum scale

tardy trench Feb 7, 2024, 7:06 AM

#

Sorry was missing context here, nvm

fallen spear Feb 7, 2024, 7:08 AM

#

yeah that's a ppl loss

#

my current theory is that the up projection in a transformer ff doesn't do anything but slightly help signal propagation

#

i was thinking about testing more weird activation functions and then i realized that this was still more interesting

#

and it calls for me to fiddle the initialization anyway which, if activation functions are mostly about signal propagation, is still a good test to have

fallen spear Feb 7, 2024, 1:16 PM

#

kevin wins, they're diverging

#

will finish the run for completionism's sake

#

amusingly the divergences show up right after loss spikes

#

it's not divergent like not training, it's just not getting closer

fallen spear Feb 7, 2024, 1:50 PM

#

https://api.wandb.ai/links/segyges/7h1367t2 figured out how to share this correctly now

Weights & Biases

train/lm_loss (24/02/07 07:50:37)

fallen spear Feb 7, 2024, 2:23 PM

#

divergences basically appear after loss spikes and then don't come back so i don't feel, uh, disinclined from the theory that this is actually because the up projection helps signal propagation

#

the smaller model isn't worse generically it's just modestly less stable

#

also, wandb draws the graph badly, on a different view it looks like there's a NaN/inf value at that loss spike at that point but i am pretty sure that's just an error

fallen spear Feb 8, 2024, 12:29 AM

#

to restate the hypothesis and prove that i have thought about mup too hard: any time you change the activation function or the width, you are implicitly changing the ratio of the model gradient relative to the model weights if you aren't using an initialization that prevents that. this can help or inhibit signal propagation. if this change is the important one with regard to, e.g., changing activation functions then the benefit will tend to trail off if using an initialization that doesn't change this ratio with scale and otherwise making sure you propagate gradients well

#

the dumbest test of this hypothesis is not to change activation functions, which is complex, it is to ablate the dimension of the model under both the existing initialization and conditions which maybe make it more stable

tardy trench Feb 8, 2024, 3:09 AM

#

fallen spear to restate the hypothesis and prove that i have thought about mup too hard: any ...

If you're using Adam, I don't think the scale of the gradients relative to the parameters actually matters, because the Adam updates are essentially invariant to the scale of the gradient, assuming the optimizer state has time to adapt. However, MuP admittedly does do other helpful things, like help you set the Adam lr, prevent softmax saturation, etc.

fallen spear Feb 8, 2024, 3:15 AM

#

tardy trench If you're using Adam, I don't think the scale of the gradients relative to the p...

well i'm gonna zero initialize it and change the softmax scale

thick briar Feb 8, 2024, 7:08 AM

#

tardy trench If you're using Adam, I don't think the scale of the gradients relative to the p...

This makes me wonder why we don't just use Lion for everything. If Adam is invariant to gradient scale in the limit, why not just ignore the scale of the gradient and use the sign instead? It's cheaper, faster, and doesn't require any tricks to prevent loss spikes (since there won't be any).

sage jetty Feb 8, 2024, 4:13 PM

#

It ignores the scale of the gradients? Does this mean that if I'm using automatic mixed precision I won't need to scale them to avoid vanishing gradients as long as I'm using Adam?

fallen spear Feb 8, 2024, 4:17 PM

#

sage jetty It ignores the scale of the gradients? Does this mean that if I'm using automati...

i am positive my logging at least thinks i am still scaling them with mixed precision and adam on neox

#

i suspect optimizer behavior is complex enough that reasoning directly about basically their behavior at some limit isn't true in practice

#

ie adam is scale invariant assuming (some stuff regarding the second moment estimation) which means it's not scale invariant

#

it's normalizing the scale somewhat

sage jetty Feb 8, 2024, 4:23 PM

#

Right, gotcha, so we still should be scaling loss to avoid vanishing gradient problem then...

fallen spear Feb 8, 2024, 4:23 PM

#

an actually scale invariant adam update is a spherical cow with zero mass that experiences no friction

#

easy to reason about, doesn't exist

sage jetty Feb 8, 2024, 4:25 PM

#

That takes me back to high-school physics that does 😂

fallen spear Feb 8, 2024, 4:27 PM

#

@buoyant turret pinged here because i have spammed other channels about this enough

#

i don't think the hopfield network thing is unreasonable

#

i do think it's unreasonable to assume that the original transformer paper got the good ratio correct at 4x in 2017 and it remains true at all scales of d_model

#

i am sorry/happy to have successfully conveyed my obsession at least enough that you're considering it

buoyant turret Feb 8, 2024, 4:31 PM

#

ideal ratio? no. But when comparing capicty the question becomes does SwiGLU result in better intermediate 'keys' to do retrieval from 'hopfield network' which allows us to scale back to 3/8 compared to 4? I'm less concerned about the ideal ratio, and more concerned about what ratio has equivalent capacity

fallen spear Feb 8, 2024, 4:32 PM

#

yeah the 8/3 is just equiparams to 4x for a simpler activation there is no reason to think it's ideal

#

i don't think there's even a very strong theoretical analysis of swiglu/geglu

#

everything i have found is just "empirically this seems to work"

#

including noam shazeer saying he has no idea why they work and attributing their success to divine benevolence

#

so if you did "assuming ff up projection works as a hopfield memory, switching from 4x swish to 8/3rds swiglu will have XYZ consequences for it" that would i think be novel

buoyant turret Feb 8, 2024, 4:36 PM

#

I wonder if we could try replicating the ROME paper at varying FFN ratios with and without SwiGLU and just see wtf is up

#

only problem is we need to pretrain several models with varying FFN ratios with and without SwiGLU first before we can even try ROME

fallen spear Feb 8, 2024, 4:43 PM

#

oh boy a whole bunch of pretraining runs, that sounds easy and not complex at all

#

also, now I have to read ROME

#

actually it looks complex but not horrifically computationally expensive

buoyant turret Feb 8, 2024, 5:08 PM

#

fallen spear actually it looks complex but not horrifically computationally expensive

its not that computatiomally expensive, the main issue is getting the pretrained models in the first place and then identifying the 'best' layer to perform knowledge editing as that could very well be different for every model, which then adds a-whole-nother layer of ablations

fallen spear Feb 8, 2024, 5:20 PM

#

https://arxiv.org/abs/2202.05262v5 paper for reference

arXiv.org

Locating and Editing Factual Associations in GPT

We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a dis...

buoyant turret Feb 8, 2024, 5:27 PM

#

fallen spear https://arxiv.org/abs/2202.05262v5 paper for reference

yeee, that's the paper I was talking about

fallen spear Feb 8, 2024, 5:42 PM

#

buoyant turret its not that computatiomally expensive, the main issue is getting the pretrained...

oh fwiw: my guess that 4x is bad is largely predicated on the notion that modern models are so large that they effectively cannot be bottlenecked on available ff storage

#

it maybe made sense at 2017 scales and also with 2017 attention it meant that you avoided the ff pass being "cold" for gpu utilization during the forward pass

#

like, you might as well put parameters there because otherwise probably compute that is provisioned to hold the attention calculation is idle

#

neither of those concerns currently seems reasonable

#

since i'm putting zerO initialization in neox i need a name that isn't terrible because "zero" is not a reasonable name for a config setting that isn't setting something to zeroes, i am renaming it to identity-hadamard or iu-hd (identity up, hadamard down) barring objection

buoyant turret Feb 8, 2024, 5:44 PM

#

I'm testing 8/3x now and I arguably have even better compute saturation because I'm able to slightly increase the micro-batch size.

fallen spear Feb 8, 2024, 5:45 PM

#

identity-hadamard is probably fine

thick briar Feb 9, 2024, 2:27 AM

#

buoyant turret ideal ratio? no. But when comparing capicty the question becomes does SwiGLU res...

The success of swiglu seems to suggest to me that the real bottleneck for a transformer is not necessarily the number of patterns (d_ff) it's able to recognize, but its ability to recognize them.

For a transformer to reason about an entire sentence or paragraph and predict the next token it basically needs to squeeze all the information about the paragraph into its hidden state of size d_model. Now, to make the model more expressive, we can do one of two things. One, we can increase d_model, giving the model a larger hidden state. Two, we can make the FF (pattern recognition) layer more complex. This allows us to encode more information within d_model than we could otherwise, because we have a more powerful method for processing the hidden state.

Ultimately I think the limiting factor for transformers is their hidden state, we either need to make it larger, or we need to give transformers ways to make more of limited space.

still grail Feb 9, 2024, 3:11 AM

#

fallen spear since i'm putting zerO initialization in neox i need a name that isn't terrible ...

Thank you, good, this is sensible I thinks.... ❤️ :')))) (or even if there's an informal non hadamard name, that might be less confusing, as hadamard multiplies make a confusing name collision that for me for a while)

dawn vine Feb 21, 2024, 11:35 PM

#

@fallen spear

#

Gemma: 16x FFN ratio!!!!

fallen spear Feb 21, 2024, 11:35 PM

#

and trained for 6 trillion

dawn vine Feb 21, 2024, 11:35 PM

#

apparently they disagree with your assessment violently

fallen spear Feb 21, 2024, 11:36 PM

#

i am definitely abandoning my theory that they do the 6x up projection for some specific reason

#

i mean... that's a lot of budget to devote to 6 trillion tokens to end up even with models with a smaller up projection

dawn vine Feb 21, 2024, 11:37 PM

#

like they had to sacrifice a ton of embedding space for that 16x

fallen spear Feb 21, 2024, 11:37 PM

#

their embedding space is freakishly gigantic though

dawn vine Feb 21, 2024, 11:37 PM

#

what? 3072 is small.. RWKV7B is 4096!

fallen spear Feb 21, 2024, 11:37 PM

#

sorry, i thought you meant the dictionary

dawn vine Feb 21, 2024, 11:37 PM

#

ah yes

fallen spear Feb 21, 2024, 11:37 PM

#

250k is uh

#

it is too many

dawn vine Feb 21, 2024, 11:37 PM

#

maybe that makes up for the small d_model somehow

fallen spear Feb 21, 2024, 11:38 PM

#

they tie the in and out projections to avoid the model being too horrifyingly dominated by their vocabulary

#

but which seems like uh

#

a really weird choice

dawn vine Feb 21, 2024, 11:38 PM

#

i love weight tying personally 🙂

fallen spear Feb 21, 2024, 11:38 PM

#

it's glorious but doesn't scale tho

dawn vine Feb 21, 2024, 11:39 PM

#

hard in various parallel situations for sure

#

screwed me up good on pipeline parallel

fallen spear Feb 21, 2024, 11:39 PM

#

their in and out matrices should be 750m apiece and they save 750m by tying them

#

but also

#

jesus christ

#

bruv you have put 750m parameters into your tokenizer, been forced to tie it to keep the budget under control, and then you only trained it on english

#

their embedding layer doesn't fit on my gpu

soft bobcat Feb 21, 2024, 11:41 PM

#

run the embedding layer on your CPU

#

it's just a lookup table

fallen spear Feb 21, 2024, 11:42 PM

#

i am not planning to run their model at all i am just bemused generally

dawn vine Feb 21, 2024, 11:42 PM

#

well i doubt these choices were stupid

#

so the 16x FFN must be a smartish deliberate tradeoff

fallen spear Feb 21, 2024, 11:43 PM

#

it's a good tradeoff if it's effectively free for some hardware configuration

#

but i am positive that for their use case the embedding is not doing anything, english doesn't have a 250k-size vocabulary that it is essential to learn

dawn vine Feb 21, 2024, 11:44 PM

#

yeah maybe at 7B, 16x FFN fits nicely on TPUs so its very fast to train

fallen spear Feb 21, 2024, 11:44 PM

#

given that one assumes their bigger models are ... bigger, i assume the 6x ffn was just doing the same thing and expanding to fill all available space on a tpu

dawn vine Feb 21, 2024, 11:46 PM

#

@fallen spear unrelated, i still havent tried token based routing on MoE but I was wondering do you know if there's any reason you can't make all the routing decisions up front on layer 0 with a scheme like that?

fallen spear Feb 21, 2024, 11:46 PM

#

dawn vine <@441658587404697600> unrelated, i still havent tried token based routing on MoE...

have you heard the good news about our lord and savior hash routing

#

(yes)

dawn vine Feb 21, 2024, 11:47 PM

#

fallen spear have you heard the good news about our lord and savior hash routing

yes i mean hash routing 🙂 tho u scared me from using it for real

fallen spear Feb 21, 2024, 11:47 PM

#

you can make the decision whenever you want

#

yeah fair

dawn vine Feb 21, 2024, 11:47 PM

#

I want to use it for real tho

fallen spear Feb 21, 2024, 11:47 PM

#

i suspect i need to make my own sandbox for that sort of thing

dawn vine Feb 21, 2024, 11:48 PM

#

fallen spear you can make the decision whenever you want

sorry, I meant can you route every layer the same way

#

or does it being kinda weirdly differently randomized on each matter

#

been trying to speed up deepspeed's slow MoE implementation and the routing seems to be part of the slowness

fallen spear Feb 21, 2024, 11:48 PM

#

dawn vine sorry, I meant can you route every layer the same way

that's the best part, nobody knows. hash routing tested it and it got slightly better

#

it got slightly better almost no matter what they did

#

you could just salt or rehash for different layers if you want to do that though

dawn vine Feb 21, 2024, 11:50 PM

#

ok, gotta find some time to try the hash route

#

if i can ever get consistent results with what ive already got 🤣

fallen spear Feb 21, 2024, 11:51 PM

#

i mean my theory of hash routing is that nothing to do with routing matters except that it be consistent for some fixed subset of input that is of a large enough size to meaningfully specialize in

dawn vine Feb 21, 2024, 11:51 PM

#

the idea that its related to the token makes sense, i dont think u can just make it literally random

#

but who knows

fallen spear Feb 21, 2024, 11:52 PM

#

okay but which tokens share experts is random

dawn vine Feb 21, 2024, 11:52 PM

#

yes, but at least each specific token is relevant to that expert always

fallen spear Feb 21, 2024, 11:52 PM

#

consistent is the criteria that seems to be important, you have to know that X input will see Y expert again

fallen spear Feb 22, 2024, 3:32 AM

#

i'll know if totally naively swapping the pythia initializations out for zero causes it to explode shortly here

dawn vine Feb 22, 2024, 3:33 AM

#

fallen spear i'll know if totally naively swapping the pythia initializations out for zero ca...

ive found w/ MoE that the rwkv inits matter a lot

#

it doesnt explode, but it just does crummy

fallen spear Feb 22, 2024, 3:33 AM

#

dawn vine ive found w/ MoE that the rwkv inits matter a lot

i mean if zero init is as good as the sticker says it shouldn't matter

#

ie, it should be strictly better

#

so exploding or doing badly are also signal

dawn vine Feb 22, 2024, 3:34 AM

#

you cant just init any old thing to zero

#

some stuff has to be random so the gradient differs

fallen spear Feb 22, 2024, 3:35 AM

#

zero init is actually a weird combination of identity matrices and hadamard matrices

#

because names are hard i guess

dawn vine Feb 22, 2024, 3:35 AM

#

o ok

fallen spear Feb 22, 2024, 3:35 AM

#

i called it identity-hadamard and it appears that it is indeed exploding

dawn vine Feb 22, 2024, 3:35 AM

#

lol

soft bobcat Feb 22, 2024, 3:38 AM

#

#

if you're using this, multiply the terms in the hadamard matrix by 1/sqrt(in-dimension)

fallen spear Feb 22, 2024, 3:39 AM

#

compared to their recommended implementation?

#

they have a recommended scaling factor etc that i have not dug into

soft bobcat Feb 22, 2024, 3:40 AM

#

oh, then try their recommended scaling instead

#

I just looked at Fig 1

fallen spear Feb 22, 2024, 3:40 AM

#

that is the one that is currently exploding so

#

i strongly suspect their initialization just depends completely on, uh

#

a lot of things that are true of their original test and are not true of pythia

#

    @torch.no_grad()
    def linear_ZerO_init_(tensor: torch.Tensor):
        # Algorithm 1 in the paper.
        assert len(tensor.shape) == 2, "linear_ZerO_init_ only works on 2D tensors"
        m, n = tensor.shape

        if m <= n:
            tensor[:] = torch.nn.init.eye_(torch.empty(m, n))
        else:  # m > n
            tensor.to("cuda")
            clog_m = math.ceil(math.log2(m))
            p = 2 ** (clog_m)
            in_tensor = torch.nn.init.eye_(torch.empty(m, p, dtype=tensor.dtype)).to(
                "cuda"
            )
            had = (hadamard(p, dtype=tensor.dtype) / 2 ** (clog_m / 2)).to("cuda")
            intermediate = in_tensor @ had
            tensor[:] = intermediate @ torch.nn.init.eye_(
                torch.empty(p, n, dtype=tensor.dtype)
            ).to("cuda")
            tensor.to("cpu")
        return tensor

soft bobcat Feb 22, 2024, 3:42 AM

#

their scaling factor is equivalent to mine

fallen spear Feb 22, 2024, 3:42 AM

#

i might have to squint at it harder

#

all i can see are dtypes and vram allocation problems atm

#

i am not sure how long i am going to wait before i kill this run

#

it's soaring so majestically

soft bobcat Feb 22, 2024, 3:45 AM

#

just kill it, you won't learn anything

fallen spear Feb 22, 2024, 3:46 AM

#

@granite plover i can't remember who else likes zero init but it doesn't work if you just drop it in

soft bobcat Feb 22, 2024, 3:46 AM

#

you can try printing torch.std_mean() of your layers with the two inits to diagnose any issues, if you want

#

I personally wouldn't bother unless I were really interested in ZerO init

fallen spear Feb 22, 2024, 3:47 AM

#

i am really interested but given that it's tightly coupled to the optimizer and god knows what other hparams i should do this in a truly tiny sandbox at some point

#

basically: this isn't math really, but it isn't engineering either, if you try to do it engineer-ily you just get a lot of null results i think

#

because every remotely scaled setup is already extremely tuned and sits at a specific optima for its hparams that you almost cannot help but disrupt

soft bobcat Feb 22, 2024, 3:49 AM

#

nah, I don't believe that. it's been easy to find small improvements in LLMs tuning dumb things

#

even fern's repo got improved significantly by naut and it had crazy amounts of hparam search before that

fallen spear Feb 22, 2024, 3:55 AM

#

link, i probably have it somewhere but could use it
that is probably true but for chasing down specifically initialization-dependent effects here it seems like it'd be a lot nicer to look at things (activations, initializations) in something much more like a sandbox and make adjustments there rather than guessing at needed adjustments at any kind of scale

#

if extra lucky it starts to feel closer to math and just generalizes easily

soft bobcat Feb 22, 2024, 3:56 AM

#

airbench #implementation-details message

#

fern's repo is in a strange state where giving the EMA higher resolution makes it perform worse

#

it could be that the EMA is not parametrized correctly, because I didn't check the math. but I found it unusual

#

airbench inherits from fern, who inherits from David Page, who inherits from one of the dawnbench leaders. it was SOTA from the start and each person made major speedups on top

fallen spear Feb 22, 2024, 4:00 AM

#

neat. anyway, i think for chasing this hypothesis specifically i am out of remotely reasonable ideas that I wouldn't want to play with in a sandbox first

still grail Feb 22, 2024, 6:39 AM

#

soft bobcat it could be that the EMA is not parametrized correctly, because I didn't check t...

I believe it's a regularization effect. There may be one or two math errors remaining but that repo has been very finely scoured.

It's been pretty consistent for me w.r.t. regularization, the EMA definitely seems to have a sweet spot.

#

Is there something you're seeing that's indicating that the EMA would be parameterized incorrectly?

soft bobcat Feb 22, 2024, 6:40 AM

#

no, I have no evidence of any error. I just found it extremely unusual that a coarser approximation does better

#

now I'm thinking about the fp16 weights though...

still grail Feb 22, 2024, 6:40 AM

#

Well it is a lookahead optimizer, so that will have some impact

#

I've had similar questions of coarseness at the end, and have tried a few different configs, but even at the end of training the semi-coarse EMA seems to be king

soft bobcat Feb 22, 2024, 6:46 AM

#

still grail I've had similar questions of coarseness at the end, and have tried a few differ...

I notice there's no grad scaler...

still grail Feb 22, 2024, 6:47 AM

#

soft bobcat I notice there's no grad scaler...

In hlb-CIFAR10 there is! It's in the loss sum portion. Things are tuned around it.

soft bobcat Feb 22, 2024, 6:48 AM

#

512/batchsize is not a big number

still grail Feb 22, 2024, 6:49 AM

#

I think the sum part cancels it out

#

= 512 is my understanding

soft bobcat Feb 22, 2024, 6:49 AM

#

torch amp starts at 65536, only goes down when necessary, and even rises to accommodate small gradients at later stages

still grail Feb 22, 2024, 6:49 AM

#

Ye

soft bobcat Feb 22, 2024, 6:50 AM

#

my current suspicion is that EMA is sensitive to fp16 quantization

still grail Feb 22, 2024, 6:50 AM

#

Alright

soft bobcat Feb 22, 2024, 6:51 AM

#

meh, it's all conjecture. without me actually putting in the work to test anything, I don't feel my guesses have value

granite plover Feb 22, 2024, 7:26 AM

#

fallen spear <@791120905651617812> i can't remember who else likes zero init but it doesn't w...

noooooooo

buoyant turret Feb 22, 2024, 12:02 PM

#

what the fuck is Gemma doing? 2048 d_model with 16384 d_ffn????????

buoyant turret Feb 22, 2024, 12:38 PM

#

AND 3072 d_model with 24576 d_ffn?????????

#

THEY'RE USING GELU??? IS IT AT LEAST GATED HOLD UP

#

oh my god

#

it's gated GELU

#

with 8x ffn ratio

#

what the fuck

#

not 8x up 4x down. it's 16x up, 8x down.

#

so actually an 8x intermediate size

#

and apparently builds on "advances made with gemini"

#

so has google ablated and found larger FFN really does work significantly better???

fallen spear Feb 22, 2024, 1:16 PM

#

buoyant turret so has google ablated and found larger FFN really does work significantly better...

or, given their hardware configuration, the massive d_ff is effectively free

#

they showed the mf 6 trillion tokens and it's like. dead even with comparable models

buoyant turret Feb 22, 2024, 1:18 PM

#

fallen spear they showed the mf 6 trillion tokens and it's like. dead even with comparable mo...

then wtf are the leaderboards even showing??? i thought it was comparable?

fallen spear Feb 22, 2024, 1:22 PM

#

buoyant turret then wtf are the leaderboards even showing??? i thought it was comparable?

no, it is. so they didn't cripple it with the massive up projection i guess but it doesn't look better either

#

leaderboards are noisy and influenced by the fiddly bits with how you do your ft

#

so they are a definite datapoint for "at scale maybe it doesn't even matter"

#

huge ratio? sure. small? also fine. whatever gives you good utilization

#

i would think it was a stronger point for that hypothesis if we knew for sure how many tokens mistral had seen

fallen spear Feb 22, 2024, 1:30 PM

#

granite plover noooooooo

yeah sorry. i assume the hadamard is scaled a lil' differently than the existing inits and the rest of the hparams are tuned to that scale

fallen spear Feb 22, 2024, 1:46 PM

#

fallen spear i would think it was a stronger point for that hypothesis if we knew for sure ho...

(ie, if mistral is actually trained in less than 6T this actually suggests the big ratio is bad here, if more, the reverse, assuming all else equal)

fallen spear Feb 22, 2024, 1:52 PM

#

fallen spear huge ratio? sure. small? also fine. whatever gives you good utilization

suddenly wondering how many layers gemma has actually because if it is fewer and assuming the giant ff parallelizes well for some hardware configuration it will have had a faster wall clock time, no?

fallen spear Feb 22, 2024, 2:03 PM

#

fallen spear yeah sorry. i assume the hadamard is scaled a lil' differently than the existing...

(in fact iirc the existing pythia inits are reeeeeally small)

buoyant turret Feb 22, 2024, 2:49 PM

#

fallen spear suddenly wondering how many layers gemma has actually because if it is fewer and...

Gemma-7B has 28 (vs Mistral-7B = 32)
Gemma-2B has 18 (vs TinyLlama-1B = 22 )

fallen spear Feb 22, 2024, 2:50 PM

#

buoyant turret Gemma-7B has 28 (vs Mistral-7B = 32) Gemma-2B has 18 (vs TinyLlama-1B = 22 )

so assuming this configuration parallelizes well on a big tpu pod they got a 4-layer-forward wall-clock speedup by doing it this way instead, i think?

#

call it 1/8th faster neglecting embeddings

#

it is "equiparams" but just seems like it would be really dependent on hardware configuration

buoyant turret Feb 22, 2024, 2:52 PM

#

lleme check the gemma code. because they also might be doing parallel MLP... nope. serial attention then MLP. it's such and odd bunch of design choices. some seemingly for speed, some seemling counter-intuitive

fallen spear Feb 22, 2024, 2:52 PM

#

... actually, do they count the embedding params in their param count?

fallen spear Feb 22, 2024, 2:53 PM

#

buoyant turret lleme check the gemma code. because they also might be doing parallel MLP... nop...

i would guess their training code does do parallel mlp and isn't what is released but could be way off base, maybe they just threw resources at it

fallen spear Feb 22, 2024, 2:53 PM

#

fallen spear ... actually, do they count the embedding params in their param count?

because: embedding retrieval should be constant-time anyway, it doesn't really contribute to fwd pass in the same way

#

and they have like .75B of them

buoyant turret Feb 22, 2024, 2:53 PM

#

you can't train parallel MLP and then deploy serial MLP because you're missing a whole bunch of layer norms

fallen spear Feb 22, 2024, 2:54 PM

#

oh you mean something actually different, i am just thinking of model parallel

buoyant turret Feb 22, 2024, 2:54 PM

#

no i mean attention(LN(x))+mlp(LN(x))+x where the LN is shared between them both

fallen spear Feb 22, 2024, 2:55 PM

#

ahhh, yeah

buoyant turret Feb 22, 2024, 2:55 PM

#

PaLM did that, but Gemma does not

#

they're also using a key dim of 256 which is fucking nuts

fallen spear Feb 22, 2024, 2:55 PM

#

tbh given the error in the gemma report about the norms i am not sure i trust any of their reported architectures to be correct

#

i trust that the reported architectural choices in every report were probably done at some point in GDM adjacent to a given release

buoyant turret Feb 22, 2024, 2:56 PM

#

im looking at their actual code, so report be dammed, this is what's actually happening

fallen spear Feb 22, 2024, 2:56 PM

#

but possibly not all at once and possibly not in the released model

fallen spear Feb 22, 2024, 2:56 PM

#

buoyant turret im looking at their actual code, so report be dammed, this is what's actually ha...

yeah, just thinking about that wrt palm

buoyant turret Feb 22, 2024, 2:57 PM

#

i mean... there could be errors in the pytorch implementation, but since pytorch and JAX load from the same parameters file and I assume their JAX implementation is defacto... there probably arent any errors in their pytorch implementation architecturally

fallen spear Feb 22, 2024, 2:57 PM

#

i think it is correct that code will reflect the actual model architecture

#

i am just a little iffy about using palm as a reference since we only know its architecture from their published reports

buoyant turret Feb 22, 2024, 3:00 PM

#

the fact that they (iirc) ablated the difference between serial and parallel attention/mlp at least leads me to believe that part is correct in their report

fallen spear Feb 22, 2024, 3:21 PM

#

i definitely believe they ran that experiment and am totally unsure if i should believe that anything available by api or release actually reflects it

still grail Feb 22, 2024, 4:54 PM

#

fallen spear ... actually, do they count the embedding params in their param count?

Nope lol

still grail Feb 22, 2024, 4:55 PM

#

buoyant turret they're also using a key dim of 256 which is fucking nuts

Is it grouped?

buoyant turret Feb 22, 2024, 5:05 PM

#

still grail Is it grouped?

yes for 2b with 1 group across 8 heads (what the fuck), and no for 7b.... wait... wait.... wait...

#

256 head dim... but 16 heads for the 7B model... with a hidden size of 3072?

#

what the fuck?

#

what the fuck is this model?

#

2B has 8 Q heads and 1KV head. 2048/8=256 so that lines up.

7B has 16Q heads and 16KV heads. 3072/16 != 256.

fallen spear Feb 22, 2024, 5:08 PM

#

buoyant turret 2B has 8 Q heads and 1KV head. 2048/8=256 so that lines up. 7B has 16Q heads an...

3072/12 is though

buoyant turret Feb 22, 2024, 5:09 PM

#

but they don't have 12 heads. they have 16.

fallen spear Feb 22, 2024, 5:09 PM

#

overlap, maybe?

buoyant turret Feb 22, 2024, 5:10 PM

#

nope.

fallen spear Feb 22, 2024, 5:11 PM

#

lolwut

buoyant turret Feb 22, 2024, 5:11 PM

#

fallen spear Feb 22, 2024, 5:11 PM

#

so it's not actually 256?

#

like, one of the assumptions here must be false

buoyant turret Feb 22, 2024, 5:12 PM

#

no it is actually 256

#

config says head dim is 256.
self.head_dim pulls from config.head_dim

#

for the 7B model QKV are all 3072*4096 matricies

fallen spear Feb 22, 2024, 5:13 PM

#

and they downproject them after doing attention?

buoyant turret Feb 22, 2024, 5:13 PM

#

yup. O is 4096*3072

fallen spear Feb 22, 2024, 5:14 PM

#

that is incredibly weird and makes me wonder what they are actually doing for attn

#

like, is this actually a standard qkv calculation or

still grail Feb 22, 2024, 5:14 PM

#

fallen spear and they downproject them after doing attention?

Woah

#

What is this magical starngeness.....

fallen spear Feb 22, 2024, 5:15 PM

#

not only do they think i am wrong about the ff up projection they put one into their qkv too

buoyant turret Feb 22, 2024, 5:15 PM

#

fallen spear like, is this actually a standard qkv calculation or

yeah. 2 implementations are available in pytorch. eager (which is normal QKV) and Flash which is equivalent to normal QKV

fallen spear Feb 22, 2024, 5:16 PM

#

buoyant turret yeah. 2 implementations are available in pytorch. eager (which is normal QKV) an...

i have to admire the fact that they released this and did not mention this modification to attention

#

i am glad at least that the qkv impl is not also an eldritch horror

#

this does sort of break intuition about what qkv is doing

buoyant turret Feb 22, 2024, 5:17 PM

#

JAX implementation looks pretty normal too. they're using the dot product attention function from FLAX, but I assume that's just normal qkv

fallen spear Feb 22, 2024, 5:18 PM

#

purely abstractly this basically gives you extra heads that don't "make sense" in a normal transformer, no?

#

and you also get to use your output projection to actually do some logic while downprojecting

#

my next question is: did they mean to do this or did someone whiff their math

#

and if someone did whiff their math, did they end up with this final architecture specifically because this configuration "just worked better" in a way that might be attributable to the attn up/down projection

buoyant turret Feb 22, 2024, 5:22 PM

#

who fucking knows

still grail Feb 22, 2024, 5:22 PM

#

fallen spear and if someone did whiff their math, did they end up with this final architectur...

Up projection but with really restricted groups too lolzies

fallen spear Feb 22, 2024, 5:23 PM

#

still grail Up projection but with really restricted groups too lolzies

yeah, the fact that standard qkv does ... that thing that it does just makes it fundamentally differerent, no?

still grail Feb 22, 2024, 5:23 PM

#

I'd guess but I'm not entirely sure

fallen spear Feb 22, 2024, 5:24 PM

#

gonna be fun to test when i am not on my lunch break

dawn vine Feb 22, 2024, 5:36 PM

#

I'm definitely gonna try some of this stuff w rwkv

#

I like the fewer layers trade for ffn and better wall clock

#

I also wonder if we can merge these changes with the way mamba integrates ffn and attn up/down projections into a single block for extra benefit

#

Tho that would break my moe

dawn vine Feb 22, 2024, 11:23 PM

#

WAIT... from Gemma:

We use a subset of the SentencePiece tokenizer (Kudo and Richardson, 2018) of Gemini for compatibility.
Does that mean that Gemini uses an even bigger vocabulary size???

soft bobcat Feb 22, 2024, 11:24 PM

#

imagine using a 512k+ vocab size instead of putting some of your 1000+ ML researchers to figuring out tokenization

dawn vine Feb 22, 2024, 11:24 PM

#

What kind of 'compatibility' exactly are they achieving...

#

And are some of those other tokens reserved for multi-modal maybe?

fallen spear Feb 23, 2024, 4:41 AM

#

dawn vine And are some of those other tokens reserved for multi-modal maybe?

this seems likely

fallen spear Feb 23, 2024, 4:42 AM

#

soft bobcat imagine using a 512k+ vocab size instead of putting some of your 1000+ ML resear...

oh yeah, i have that on my todo

thick briar Feb 23, 2024, 12:14 PM

#

fallen spear this does sort of break intuition about what qkv is doing

hmm how so? basically you're just increasing the dimensionality of the heads without increasing the hidden state dimension

#

which shouldn't change the intuition for how information flows in attention

thick briar Feb 23, 2024, 12:16 PM

#

dawn vine I like the fewer layers trade for ffn and better wall clock

I think that generally we have too much attention in comparison to FFN

#

Attention takes up a lot of parameters but does no real computation over the sequence, it's just an information mixer

#

If parallelism wasn't useful then I would advocate making the FFN intermediate dimension as small as possible and duplicating FFN layers to make up the difference, increasing the depth and thus computational complexity of the model

dawn vine Feb 23, 2024, 3:52 PM

#

kind of seems like better way to achieve parallelism is bigger d_model and smaller FFN tho

#

can always shrink for attention

#

or use subset of d_model

boreal moss Feb 23, 2024, 3:54 PM

#

so did anyone tried trick used sometimes in cnns that is to use only part of the model width for time mixing layer (attention in this case) and the rest just go straight through? this way it is possible to use smaller ffn multiplier and keep attention dim low

#

I was talking about this to @fallen spear some time ago

#

usually in cnns you can do conv on only half of the model dim and it doesn't change performance much or at all but runs considerably faster

buoyant turret Feb 23, 2024, 4:03 PM

#

boreal moss usually in cnns you can do conv on only half of the model dim and it doesn't cha...

i feel like that's dependent on the conv type. depthwise conv on too few dimensions can lead to significantly lower performance because it does spatial mixing separately for each channel.

boreal moss Feb 23, 2024, 4:06 PM

#

buoyant turret i feel like that's dependent on the conv type. depthwise conv on too few dimensi...

actually this works best with depthwise conv

#

and that is very counterintuitive, I don't know why but it doesn't look at all that you need much mixing in that dimension in comparison to ffn parameter count

#

I'm talking from my own experiments

buoyant turret Feb 23, 2024, 4:08 PM

#

i wonder if the lack of channel-spatial mixing weights lets the convolutions learn features more easily?

#

kinda like how I've noticed in some cases LoRA converges faster than full-finetuning because there are literally fewer variables to optimize

boreal moss Feb 23, 2024, 4:12 PM

#

the same happens with large conv kernels, they learn very slow, but when you add parallel additive small conv kernel (which is completely redundant) it converges much faster, then for inference you just absorb that small kernel into the big one and compute just that

dawn vine Feb 23, 2024, 4:23 PM

#

boreal moss so did anyone tried trick used sometimes in cnns that is to use only part of the...

i didnt try it but i bet it works great, thats what i meant above by 'use a subset of d_model' for attention

still grail Feb 23, 2024, 4:52 PM

#

boreal moss actually this works best with depthwise conv

Oh gosh no no no

#

Depthwise conv is horribly time and learning inefficient

#

You get half a kernel for the price of two!

#

It's a good idea in theory but really only seems to be best suited for certain cpu-only inference options that aren't 3x3 conv friendly. ❤️ :'))))

still grail Feb 23, 2024, 4:56 PM

#

boreal moss usually in cnns you can do conv on only half of the model dim and it doesn't cha...

This almost feels a bit like densenets to me. :3333 ❤️ :'))))

still grail Feb 23, 2024, 4:57 PM

#

buoyant turret kinda like how I've noticed in some cases LoRA converges faster than full-finetu...

I've noticed the smaller a good model is (generally speaking) the more tolerant it is of strange training conditions (not always though....:')))) )

buoyant turret Feb 23, 2024, 5:05 PM

#

still grail This almost feels a bit like densenets to me. :3333 ❤️ :'))))

or inception

still grail Feb 23, 2024, 5:06 PM

#

Yeah good point, I forgot about that bypass

#

Kernel launches are just way too slow though, gotta go fast gotta do it all at once! XD ;PPPP

boreal moss Feb 23, 2024, 5:15 PM

#

depthvise convs are very slow on gpus but that is rather happy little accident that this trick works so good

still grail Feb 23, 2024, 5:17 PM

#

boreal moss depthvise convs are very slow on gpus but that is rather happy little accident t...

?

boreal moss Feb 23, 2024, 5:18 PM

#

that you can just not compute half of them and it doesn't tank performance

still grail Feb 23, 2024, 5:18 PM

#

It depends upon the problem, I believe.

boreal moss Feb 23, 2024, 5:19 PM

#

I see that behavior on image and language models

still grail Feb 23, 2024, 5:24 PM

#

boreal moss I see that behavior on image and language models

Yeah I guess I could see down sampling working okay, one of the big problems is it still almost always requires a second kernel to either merge the info for passing or re-upsampling which can be, er, rather pricy in the super-efficient edge cases. 😬😬😬😬

#

It would be nice to be able to do it in a super efficient way though

#

Haven't been able to find one quite yet for that AFAIU

dawn vine Feb 23, 2024, 5:42 PM

#

still grail It would be nice to be able to do it in a super efficient way though

i think trombocyts idea is that you dont have to do any special work to merge/upsample, because the next layer ends up using both parts even tho only one half got updated?

still grail Feb 23, 2024, 5:43 PM

#

I'm a bit confused I think. Generally you have to do a concatenate or the like in order to avoid a bifurcation of kernels

dawn vine Feb 23, 2024, 5:58 PM

#

still grail I'm a bit confused I think. Generally you have to do a concatenate or the like i...

ive never tried his trick, but yeah you'd have to concat

#

im assuming torch.compile to remove multiple kernel calls tho

#

under the hood it wouldn't have to actually concat and could just work on the first half in-place, it could just copy off the original for autograd use later before it begins work

#

either way tho, we may just be in very different headspaces here... one extra CUDA kernel per layer is a very small price to pay for reduced compute/params from where im standing 🙂

thick briar Feb 24, 2024, 2:54 AM

#

dawn vine or use subset of d_model

maybe FFNs can only use a subset too, but they would be overlapping, like even-layered FFNs can use the top 2/3 of d_model, odd-layered FFNs can use the bottom 2/3

#

and maybe attention can use the 1/2 in the middle with 1/4 on both sides

#

that way we have a larger hidden state with a "message passing" area between FFNs

thick briar Feb 24, 2024, 2:57 AM

#

thick briar Attention takes up a lot of parameters but does no real computation over the seq...

(Which is I think why parallel attention/FFN works so well -- I think the only reason that there's degradation at smaller model sizes is because of the useless first-layer FFN)

tardy trench Feb 24, 2024, 3:01 AM

#

thick briar Attention takes up a lot of parameters but does no real computation over the seq...

I disagree about this. If you're training from scratch, you can't get the same results with, say, 2x fewer attention heads

#

And if you try to replace the queries with trainable vectors, perf drops even further

thick briar Feb 24, 2024, 3:02 AM

#

tardy trench I disagree about this. If you're training from scratch, you can't get the same r...

Oh attention is def really important, I don't disagree there and its strength is the whole reason why subquadratic architectures are worse

#

But in terms of actually "thinking about" the input sequence, attention doesn't do anything

#

Its job is to mix info for the benefit of the FFN layers

#

I believe though that at large model sizes we have too much attention

#

Which is why GQA works so well

tardy trench Feb 24, 2024, 3:08 AM

#

GQA/MQA basically work because of non-identifiability (https://arxiv.org/abs/2007.00810) but I think you still need approximately n_query_heads*d_v = d_model for good perf. I think anything that decreases the LHS will hurt performance since the attn block residual will have rank < d_model.

arXiv.org

On Linear Identifiability of Learned Representations

Identifiability is a desirable property of a statistical model: it implies that the true model parameters may be estimated to any desired precision, given sufficient computational resources and data. We study identifiability in the context of representation learning: discovering nonlinear data representations that are optimal with respect to som...

#

At that point, if you use n_query_heads*d_v < d_model, you're basically betting that the optimizer can't make good use of the extra dimensions

#

Maybe for a given param budget, ff is a better investment, but if you're holding ff constant, using less attn params will always hurt performance for a sufficiently complex task

#

Maybe we don't disagree, idk

thick briar Feb 24, 2024, 3:15 AM

#

tardy trench GQA/MQA basically work because of non-identifiability (https://arxiv.org/abs/200...

I see your point, because GQA still allows us to retrieve d_model-length information (while decreasing d_h*n_h would limit the rank of the residual as you said), although that I think that at larger model sizes the rank of the attention residual might be < d_model

#

For example llama-70b, do we really need to retrieve in full d_model information 56 times in a row?

thick briar Feb 24, 2024, 3:16 AM

#

tardy trench Maybe for a given param budget, ff is a better investment, but if you're holding...

I definitely agree with this, less attention = less performance

tardy trench Feb 24, 2024, 3:17 AM

#

thick briar I see your point, because GQA still allows us to retrieve d_model-length informa...

It's not impossible, but it seems unlikely to me, because that would imply the optimizer was not using the full capacity.

Re your second point, nothing is really "necessary", but using less than the full capacity will generally hurt performance when maxing over possible tasks

thick briar Feb 24, 2024, 3:18 AM

#

tardy trench It's not impossible, but it seems unlikely to me, because that would imply the o...

I see what you mean although isn't the success of GQA proof that optimizers aren't able to use the full capacity of traditional attention?

#

GQA working basically means that each token needs to transmit far less information to its peers than we previously thought

tardy trench Feb 24, 2024, 3:20 AM

#

Technically yes, but the capacity difference between using multihead and multiquery is actually quite small. Multiquery basically just factors all the key projections as W_k*W_rh, and then moves the W_rh term to the corresponding query projection W_qh (and analogously for the value and output projections)

thick briar Feb 24, 2024, 3:20 AM

#

thick briar I definitely agree with this, less attention = less performance

I just think that for the task of language modeling specifically, we have more attention than we need for practical environments. I think this is only true for larger model sizes though, if you started taking attention layers off gpt2-small I agree you'd see large drops in perf

tardy trench Feb 24, 2024, 3:22 AM

#

thick briar I just think that for the task of language modeling specifically, we have more a...

Agree, if you're fitting to the current distribution of tasks you might be able to get some savings. But at that point, you aren't using a general method anymore and it may underperform on more challenging tasks in the future

thick briar Feb 24, 2024, 3:23 AM

#

tardy trench Technically yes, but the capacity difference between using multihead and multiqu...

That's true, each token can still retrieve up to d_model dimensions of information from its peers every attention layer, GQA doesn't limit that

thick briar Feb 24, 2024, 3:24 AM

#

tardy trench Agree, if you're fitting to the current distribution of tasks you might be able ...

True I can't disagree with that, we're basically overfitting our architecture to language modeling. I don't think that's too bad of a thing to do though considering LLMs are by far the main application of transformers 🙂

tardy trench Feb 24, 2024, 3:26 AM

#

Yes, so far!

fallen spear Feb 24, 2024, 5:47 PM

#

thick briar But in terms of actually "thinking about" the input sequence, attention doesn't ...

i am not sure this is at all true

#

selectively mixing info is thinking about the sequence

thick briar Feb 25, 2024, 1:33 AM

#

fallen spear selectively mixing info *is* thinking about the sequence

Idk, is computing a weighted linear combination of the inputs really a thought process?

#

If attention did computational work that was actually useful, then parallel attention/MLP layers would suffer a lot of degradation

fallen spear Feb 25, 2024, 2:02 AM

#

thick briar Idk, is computing a weighted linear combination of the inputs really a thought ...

all matmuls are linear combinations of the inputs

thick briar Feb 25, 2024, 2:17 AM

#

fair point

dawn vine Feb 29, 2024, 5:23 PM

#

@fallen spear finally tried the most stupid possible hash routing MoE and it did great... i literally used expert_id = token_id % 8

#

same on every single layer

still grail Feb 29, 2024, 5:40 PM

#

dawn vine <@441658587404697600> finally tried the most stupid possible hash routing MoE an...

One expert per token do it do it

#

Come on don't chimken out nowwwwwwwww......

Dooo it doooo it doooo it doooo it

fallen spear Feb 29, 2024, 5:46 PM

#

dawn vine <@441658587404697600> finally tried the most stupid possible hash routing MoE an...

can you salt it separately per layer and check if it improves, for curiosity?

fallen spear Feb 29, 2024, 5:46 PM

#

still grail One expert per token do it do it

ironically this doesn't work

still grail Feb 29, 2024, 5:47 PM

#

fallen spear ironically this doesn't work

?!??!

fallen spear Feb 29, 2024, 5:47 PM

#

one of the failure modes of hash routing was when they made modifications that created too many buckets

#

the bucket:expert ratio appears to be sort of delicate

still grail Feb 29, 2024, 5:47 PM

#

Surely it worked here, clearly and obviously it works in the limit

#

You must not have been brave enough

fallen spear Feb 29, 2024, 5:48 PM

#

i mean i guess the issue was there were too many more buckets than experts

still grail Feb 29, 2024, 5:48 PM

#

fallen spear the bucket:expert ratio appears to be sort of delicate

Oh no, a rational U curve that makes sense, ahhhhhhh

still grail Feb 29, 2024, 5:48 PM

#

fallen spear i mean i guess the issue was there were too many more buckets than experts

Oh

fallen spear Feb 29, 2024, 5:48 PM

#

it is possible that proliferating experts is fine

#

the way they proliferated buckets btw was to hash tok and prev_tok instead of just tok

still grail Feb 29, 2024, 5:49 PM

#

Here's a (maybe) fresh idea: hierarchical bucketing. Have one always on bucket. The second bucket is chosen of two. The third is chosen of four, etc.

Should bypass the information clustering per token issues a bit.

Buckets are all uniform in size.

#

This basically, if the grouping is done "correctly", should let the "active buckets" switch at appropriate levels of granularity per the active incoming info stream

fallen spear Feb 29, 2024, 5:51 PM

#

i don't understand this at all

still grail Feb 29, 2024, 5:51 PM

#

How you do token grouping can I believe be learned in a differentiable manner using the same exact freaking structure of fast feed forward networks. Should work similarly too, I thinks?

still grail Feb 29, 2024, 5:52 PM

#

fallen spear i don't understand this at all

Binary tree where output active expert group is a concatenation of the weights from the head node to the leaf node.

#

Allows for hierarchically increasing refinement specific to each input

dawn vine Feb 29, 2024, 5:53 PM

#

I love that this sparked u guys coming up with insane amazing ideas

still grail Feb 29, 2024, 5:54 PM

#

Making it learnable helps reduce "wasted" information learned across buckets by forcing it to be deferred to higher in the hierarchy, but in a differentiable manner to avoid any silliness/complexity from a given scheme -- let's let it be data-defined! 😄 :')))) ❤️

still grail Feb 29, 2024, 5:54 PM

#

dawn vine I love that this sparked u guys coming up with insane amazing ideas

Hey, moving the conversation along and trying things maybe is the secret skill

#

I'm honestly not sure where in the heck this is coming from myself lol

#

But I'm just letting it roll, ya know? XDXDXDXD 😭😭😭😭👍

dawn vine Feb 29, 2024, 5:55 PM

#

Please let it roll!!!

#

I'm in the middle of moving so I don't understand the tree idea yet but will reread later. I love fast ffns

fallen spear Feb 29, 2024, 5:58 PM

#

still grail Making it learnable helps reduce "wasted" information learned across buckets by ...

i need this broken down in a much dumber way if i am going to follow it

#

we have a token T, it hits a routing layer (which might just be the embedding layer), what occurs?

still grail Feb 29, 2024, 5:59 PM

#

still grail Binary tree where output active expert group is a concatenation of the weights f...

@fallen spear does that make sense now.

So like, the head node is always on. This is the always active group. The second depth layer is selected by [some strategy], preferably data dependent like being chosen based on the raw token or something similar to that (can also do like for example a conv of an n-markov chain of the input embeddings of previous tokens, for example).

Anyways, the second group is chosen based on that, and concatenated to the running set of weights as before.

I believe it should work out naively too w.r.t. the swiglu/gelu nonsense or whatever. Just basically do your own build-a-bear "Build Your own Ball [of weights]" kinda dealio.

still grail Feb 29, 2024, 6:00 PM

#

fallen spear i need this broken down in a much dumber way if i am going to follow it

Ok yeah gotcha

#

@fallen spear I guess if we're doing our MLP/whatever routing layers based upon the token embeddings, we can basically do something where we do the branch traversal as in FFN.

So, we always get the base node of the tree "for free", these are the always on expert weights.

Then we do our decision layer, and get our sigmoidal left/right branching for the first leaf node of the tree. During training, this can be the sigmoid-gated sum of the two values, during inference it can be a hard selection based upon the sign of the value (assuming no bias and all'of'dat).

This is the second weight group "block". This is kept in a list with the first weight group. If our tree is 8 nodes deep, for example, then each weight group block is of the size mlp_depth // 8. Additionally, this means that our tree will have 2^n-1 total weight blocks, each with depth mlp_depth. This can be a thing that is pretty sizeable pretty fast but also I think that's alright as it's also potentially pretty flexible space wise.

They all get added back in to the main values at the end of the linear block, so thankfully we (should is the keyword heresies) be able to independently pick-and-choose each block as needed over the course of training. And, of course, because it's FFN stuff, it gets jointly optimized which can be both a good and a bad thing, but at least it's sorta simpler engineering-wise (once it's up and running IMPE at leastsies....) so that bodes well fur scaling, I thinks.

This is exceptionally nice as well as it embeds the idea of a hierarchy of knowledge in the network, and it switches on and off as needed depending upon the input data. This feels much more similar as best as I understand to how the human brain function and how the astrocytes function w.r.t. affect associative lookups, so it at least feels to me like it's some kind of step in the right direction.

#

I'm still waking up and have no idea what the heck mood this is that I'm in but it's nice, I haven't had consistent "idea storms" for a few years now, so this is generally a rather confusing but pleasant experience for me. :'3333 ❤️ 🥲 ❤️

fallen spear Feb 29, 2024, 6:46 PM

#

still grail <@441658587404697600> I guess if we're doing our MLP/whatever routing layers bas...

i like the additive part

#

one of the nagging problems with moe is that experts are very similar to each other

#

(also I understand this now)

fallen spear Feb 29, 2024, 6:47 PM

#

fallen spear one of the nagging problems with moe is that experts are very similar to each ot...

ie, they contain a lot of duplicate information

still grail Feb 29, 2024, 6:50 PM

#

Yeah that's why a binary tree is good I think

#

It's not perfect of course, but like you can of course adjust ratios (of layer num_parameters and branching factors, etc, etc) to accommodate for that.

still grail Feb 29, 2024, 6:54 PM

#

fallen spear i like the additive part

Which additive part

#

The output projection, I thinks? I guess?

fallen spear Feb 29, 2024, 6:55 PM

#

still grail The output projection, I thinks? I guess?

yes

still grail Feb 29, 2024, 6:55 PM

#

What's also cool is that if you want uneven depths in your tree, or some other kind of truncation/compression, you can do so as well, at least mathematically IMPU

#

It would screw with batching a bit (well a binary tree expert selection mechanism would be a bit annoying to code period, but also might be quite useful as well. ❤️ :')))) )

dawn vine Feb 29, 2024, 7:22 PM

#

still grail <@441658587404697600> does that make sense now. So like, the head node is _alwa...

i currently do 'always on' base FFN and then add one (or maybe zero) of the eight experts to the result

#

so this would be like a further set of levels of what i'm already succeeding with

still grail Feb 29, 2024, 7:25 PM

#

dawn vine i currently do 'always on' base FFN and then add one (or maybe zero) of the eigh...

Ah, gotcha, thank makes senses to mes, I thinks. ❤️ :'))

#

So this then is basically just stretching it along the tree traversal route

#

Since some levels of weights are probably somewhat group-specific, but aren't entirely "all or nothing" like an always on, or some extremely group-specific weight impe....

dawn vine Feb 29, 2024, 7:27 PM

#

recently this has been as a continuation from an already-trained RWKV model w/ pre-existing FFN

#

so this adds onto it, starting with all zeros and slowly learning to differentiate

#

seems to work great

still grail Feb 29, 2024, 7:28 PM

#

dawn vine so this adds onto it, starting with all zeros and slowly learning to differentia...

This in this case being....

dawn vine Feb 29, 2024, 7:29 PM

#

still grail This in this case being....

what i've been testing with the 'always on' base FFN plus additive MoE on top

still grail Feb 29, 2024, 7:29 PM

#

Now I'm a bit confused lol

#

For me, the always on bit is weights that are basically always used with no gating

dawn vine Feb 29, 2024, 7:30 PM

#

still grail For me, the always on bit is weights that are basically always used with no gati...

yes, the always on is the pre-existing RWKV FFN (we call it chanmix)

still grail Feb 29, 2024, 7:30 PM

#

Then an FFN is used to progressively select the conditional branches of weight layers

dawn vine Feb 29, 2024, 7:30 PM

#

still grail Then an FFN is used to progressively select the conditional branches of weight l...

there is only one branch im using currently, and it can be selected using gating or hash routing... hash has been just as useful

still grail Feb 29, 2024, 7:30 PM

#

dawn vine yes, the always on is the pre-existing RWKV FFN (we call it chanmix)

Oh FFN here is feed forward network

#

Not Fast Feedforward network

dawn vine Feb 29, 2024, 7:30 PM

#

yeah sorry not FFFN

still grail Feb 29, 2024, 7:31 PM

#

Yeah I think that was my bad

still grail Feb 29, 2024, 7:31 PM

#

dawn vine what i've been testing with the 'always on' base FFN plus additive MoE on top

Ah gotcha yeah

dawn vine Feb 29, 2024, 7:32 PM

#

I tried other ways of combining the base FFN and the chosen expert, but additive seems better maybe especially when starting from an existing trained model

still grail Feb 29, 2024, 7:32 PM

#

The nice thing about token specific selection I guess if I'm understanding this correctly is it makes a lot of the "predictive" (or predictive related) aspects of things easier to manage

#

And here, I'm meaning that the outputs of the two are added together

#

Oh, I guess there's the sigmoid gated additive interpolation or whatever

dawn vine Feb 29, 2024, 7:33 PM

#

yeah my recent implementation is like

out = x + ffn(xn) + moe(xn, tokens)   # tokens passed so i can hash em```

still grail Feb 29, 2024, 7:34 PM

#

And theoretically that could be fused into one kernel without breaking the bank, I thinks (maybe a bit moar complexity thosies....)

#

Buh it seems to be hyperscaling friendly basically! 😄 :')))) 👍

dawn vine Feb 29, 2024, 7:35 PM

#

well you're missing some stuff like Expert Parallel requring some comms in the middle so hard to do that in practice

still grail Feb 29, 2024, 7:39 PM

#

dawn vine well you're missing some stuff like Expert Parallel requring some comms in the m...

Yeah I was thinking with token-based stuff you could avoid that some, however, that does not really get around the kv-cache issues all that much does it, really.

dawn vine Feb 29, 2024, 7:40 PM

#

still grail Yeah I was thinking with token-based stuff you could avoid that some, however, t...

we have no kv cache issues (or kv cache at all!) in RWKV, but to fit MoE into VRAM for training we gotta go expert parallel so each GPU can hold e.g. one expert

still grail Feb 29, 2024, 7:42 PM

#

dawn vine we have no kv cache issues (or kv cache at all!) in RWKV, but to fit MoE into VR...

Ack, gotcha, didn't know you were team RWKV. That makes sense to me, then. I got the distributed expert part at least.

#

Variance of expert selection has got to be a bear, lolz

#

I would like to propose selective expert leaf dropping as a variance/routing management strategy in the binary tree case

dawn vine Feb 29, 2024, 7:45 PM

#

still grail I would like to propose selective expert leaf dropping as a variance/routing man...

the binary tree doesn't work with hashing, right? just learned?

still grail Feb 29, 2024, 7:45 PM

#

One could make the block sizes get smaller as you go towards the leaves (or maybe just keeping the tree not too terribly deep), but leaves maybe could be dropped in favor of a "generic" set of weights learned as a backup in case some batch inference scheduler determines that it would be too costly to run some kernel for a horribly miniscule amount of nodes

still grail Feb 29, 2024, 7:46 PM

#

dawn vine the binary tree doesn't work with hashing, right? just learned?

It's partition based so if you can make a function that generates partitions from a hashed item then you should be good I think

dawn vine Feb 29, 2024, 7:47 PM

#

still grail It's partition based so if you can make a function that generates partitions fro...

seems pretty ez to achieve

#

might be a bit slower since i gotta do several levels instead of one, but i like the idea

still grail Feb 29, 2024, 7:47 PM

#

The tree does include some notion of structure though so I think any binning strategy should try to take that into account

#

Otherwise it's just throwing darts in higher dimensions thensies. ❤️ :'))))

dawn vine Feb 29, 2024, 7:47 PM

#

I also would like to try switching out the experts with LowRank approximations and/or Butterfly Matrices

dawn vine Feb 29, 2024, 7:48 PM

#

still grail The tree does include some notion of structure though so I think any binning str...

I think it's okay if I just use one bit of the hash for each level of a binary tree 🙂

#

the main thing seems to be to have the same 'expert' always be consulted for a given token

still grail Feb 29, 2024, 7:49 PM

#

dawn vine I think it's okay if I just use one bit of the hash for each level of a binary t...

Yeah, I'm not sure what informationally that will bring

dawn vine Feb 29, 2024, 7:49 PM

#

so this would still keep things consistent even w/ a tree based on each bit of the hash

#

it doesnt need anything informationally, it just allows us to store more info in fewer parameters

still grail Feb 29, 2024, 7:50 PM

#

😬

#

Lemme think about this

#

Because hashes are supposed to be rather spicy in how they jump about

dawn vine Feb 29, 2024, 7:51 PM

#

yah thats their goal, but in MoE apparently the only important thing is that the same expert is consulted for the same token always

still grail Feb 29, 2024, 7:51 PM

#

Ideally, (intellectually, which is def different from experiments as we all know lolz), one would want a higher node in a tree to correspond to a token group

dawn vine Feb 29, 2024, 7:51 PM

#

so as long as token X always goes the same route through the tree i think we're fine

still grail Feb 29, 2024, 7:51 PM

#

dawn vine yah thats their goal, but in MoE apparently the only important thing is that the...

Yes, this is def a step forward but then it results in the information duplication problem!

#

The tree is meant to dedupe that information by clustering it into hierarchical nodes

#

So that way, we have token routing consistency, but then also similar tokens get routed through similar branches of the tree

#

Hence, more "free space" to learn various concepts

#

The problem of course is that this is data conditional, but thankfully if you do it based on the token itself, then you can just skip the routing function entirely and split the learned embedding up to route it. I think? That should be efficient?

dawn vine Feb 29, 2024, 7:57 PM

#

I don't know if I believe in the 'similarity' idea any more... @fallen spear dissuaded me with this hash routing stuff

#

because if similarity mattered, hash routing shouldn't work as well as learned gated routing

#

but it does

still grail Feb 29, 2024, 7:59 PM

#

routing_table = nn.Embedding([num_tokens, tree_depth*total_num_layers]) routed = routing_table(inputs).sigmoid().view(tree_depth, total_num_layers) # optimization hack: repeat the branching strategy on each side of the tree when weight averaging. Should converge albeit being a bit strange

#

Here's an efficient input-token-dependenr learned routing table that uses a routing symmetry approximation to make the process much simpler (and doesn't need an FFN, to boot!)

still grail Feb 29, 2024, 8:00 PM

#

dawn vine because if similarity mattered, hash routing shouldn't work as well as learned g...

Yeah, that is a very interesting data point to me

#

Like it feels like one that def needs an explanation

#

(not saying you have to give it, I'm meaning it seems like it's something I'm curious about sorta)

dawn vine Feb 29, 2024, 8:01 PM

#

still grail (not saying you have to give it, I'm meaning it seems like it's something I'm cu...

it seems that the reality of the situation is as follows:
a FFN has the capacity to learn a certain amount, but it can learn how to deal with lots of different kinds of inputs (proven by how we use them traditionally for every input!)
so if you add more FFNs, you just gotta make sure the inputs roughly match each time so that you're now spreading the computation across these, allowing you to train and inference as fast as before but with much more parameters available

still grail Feb 29, 2024, 8:01 PM

#

Just because duplicating knowledge feels kinda like a Bad Thing, especially if we can control things relating to it

still grail Feb 29, 2024, 8:01 PM

#

dawn vine it seems that the reality of the situation is as follows: a FFN has the capacity...

Yep

#

Having consistent routing is def good

#

Would be insane otherwise

#

But one problem is that the amount of paper parameters and the learned parameters are a bit different I think due to the network having to double, triple, quadruple, octuple learn things, etc

#

Consistency is def good because it keeps a match between those things

#

But a question is how to factorize speedily the process so that shared learned things sorta all generally live in the same place, as it were.

dawn vine Feb 29, 2024, 8:04 PM

#

oh also we could reduce the size of the FFN at each successive layer

#

because it's handling fewer tokens

still grail Feb 29, 2024, 8:04 PM

#

Yeah I was thinking about that for the leaf nodes in the binary tree example

#

I could see there being distributed nodes that basically fall back to blocksparse as the leaves themselves can get rather specialized

#

Which is interesting as it sorta ties throughput to useful model capacity, with tons of throughput, you can add more and more leaf nodes to the binary tree without as much direct negative impact iiucsies

dawn vine Feb 29, 2024, 8:06 PM

#

would be real interesting to add half size each level, so you end up linear param ct in the number of layers

#

i guess that kind of kills the benefits of GPU parallelism if you keep making it less and less parallelizable

still grail Feb 29, 2024, 8:09 PM

#

Yeah that would be interesting

#

That feels friendlier to scaling people at least lolz

#

Because eventually nodes can handle multiple layers at once (and you can colo them!!!!!!)

#

The halving is intriguing and I wonder how well it works

dawn vine Feb 29, 2024, 8:09 PM

#

yeah its crazy

#

im interested by it

#

but i dont know how to implement it so its not super slow

still grail Feb 29, 2024, 8:10 PM

#

Does the embedding routing trick from earlier make sense

dawn vine Feb 29, 2024, 8:10 PM

#

whats the difference between tree_depth and total_num_layers - is total_num_layers the layers in the model?

still grail Feb 29, 2024, 8:11 PM

#

This would predict a routing tree for every MLP basically at the start

#

That might even be redundant

#

But I'm not sure and wanted to err on the side of caution

#

With the depth halving one could maybe just calculate all of the experts at once and then fold them in with a shallow depth tree (say, depth 3-4 or something like that)

#

To avoid linear layer weight madness

dawn vine Feb 29, 2024, 8:13 PM

#

OH i forgot that each layer of depth in the tree is independent 🙂

#

its all just additive, right?

#

so its super parallelizable in some sense

still grail Feb 29, 2024, 8:13 PM

#

Yeah

dawn vine Feb 29, 2024, 8:13 PM

#

coooool

still grail Feb 29, 2024, 8:14 PM

#

So you can sigmoid post hoc

#

To basically get the same behavior

#

But keeping the weights dense

dawn vine Feb 29, 2024, 8:14 PM

#

that's like FFFN

still grail Feb 29, 2024, 8:14 PM

#

Yep

#

Oh haha also I think you can use 2:1 structured sparity for each leaf node by just interleaving weights together and selecting one of two (or just inverting) the binary weight masks, maybe that's silly but I think it's hilarious.

#

Maybe specialized solutions would be better

#

But your halving idea made me think about that

#

They can always be loaded interleaved as one "set of weights" in memory, then the mask can be shifted up or down by one to "let through" the proper outputs for each item lolz

dawn vine Feb 29, 2024, 8:16 PM

#

so basically (for the 1/2 size each level idea) for a tree of depth D we just compute D normal sized FFN's in parallel, right? then we can select a subset of the results based on the routing table

still grail Feb 29, 2024, 8:17 PM

#

Yes exactly

dawn vine Feb 29, 2024, 8:17 PM

#

so cool

still grail Feb 29, 2024, 8:17 PM

#

Wow I felt really giddy when you wrote that lol

#

This is really freaking cool isn't it ':DDDD

dawn vine Feb 29, 2024, 8:17 PM

#

I really love it. Feeling giddy too

#

I have no idea if it works hahahahah

#

But it's really awesome

#

after considering it for a minute, my guess is that just like FFFN, it's only really helpful at inference time

still grail Feb 29, 2024, 8:20 PM

#

I think a good blending strategy would be to apply the structure of the routing table to the tree to mix things together. So for example if the node structure is .1 .8 .2 during training, we can apply this structure to each node. Because we're not predicting the entire tree, only one index.

This has an advantage of not taking up much space, and also oddly enough it implies structure across the tree earlier on in training, which your hashing stuff shows us I think shouldn't be catastrophic for the model (and it should basically factorize out as the branch selection gets more and more popular)

still grail Feb 29, 2024, 8:20 PM

#

dawn vine I really love it. Feeling giddy too

Yes I am very much enjoying this collaborating

still grail Feb 29, 2024, 8:21 PM

#

dawn vine after considering it for a minute, my guess is that just like FFFN, it's only re...

Yeah though you could lock in branch structure progressively XD

dawn vine Feb 29, 2024, 8:21 PM

#

yeah i have a [very frozen] community project about doing that branch locking for FFFN

still grail Feb 29, 2024, 8:22 PM

#

Probably with some small penalty by having learned shared info between experts, but as you noted earlier, if the network can take the abuse of hashing, then a little bit would be not too terrible maybe

#

I have a feeling it will be a logarithmic kinda thingie

dawn vine Feb 29, 2024, 8:22 PM

#

I think the key is just to make your choices early on in training and stick to em

still grail Feb 29, 2024, 8:23 PM

#

Yeah that works I guess

#

I think as long as the bulk of the higher level conceptual layers (like higher up in the tree) shared between sub-nodes are generally informatively positioned, then that's more important than the structure of the particular specialist layers

dawn vine Feb 29, 2024, 8:24 PM

#

I'm trying to think of a 'fair' test for my existing code that would add a second layer and work somewhat in this fashion

#

i guess i can use 4 normal sized experts at depth 1, and 8 half sized experts at depth 2 (to total the same as my current 8 normal sized experts)

still grail Feb 29, 2024, 8:27 PM

#

Would it be 1/4, 1/4, 1/8, 1/8, 1/8, 1/8 depths?

#

(for the two layers)

dawn vine Feb 29, 2024, 8:29 PM

#

well currently I choose 1 of 8 1x sized FFNs and add the result to the depth 0 1x sized FFN
so I was going to change that
from 1FFN + 1 of 8 1FFN
to 1FFN + 1 of 4 1FFN + 1 of 8 0.5FFN

#

more compute, but same parameters

still grail Feb 29, 2024, 8:31 PM

#

Oh okay, gotcha, sorta b-tree kinda-like business goin' on ovah hier. ❤️ :')))) 👍

dawn vine Feb 29, 2024, 8:31 PM

#

yeah only bc i need some baseline to test against

#

and it started life as an 8 way b-tree of depth 2

still grail Feb 29, 2024, 8:34 PM

#

See you invented this first, you have a Schmid window schmid schmidhuber

dawn vine Feb 29, 2024, 8:34 PM

#

haha you invented it too

#

i cant believe there are TWO schmidhuber emojis LOL

still grail Feb 29, 2024, 10:10 PM

#

dawn vine i cant believe there are TWO schmidhuber emojis LOL

I'm sorta a little excited. No need for details if there aren't any (sorry if so 😬) but I'm sorta curious about experiment plans if there's anything at the moment. Maybe I should look into trying it out under hlb-gpt at some point toosies, seems like a decent fit for it. ❤️ :'))))

dawn vine Feb 29, 2024, 10:11 PM

#

still grail I'm sorta a little excited. No need for details if there aren't any (sorry if so...

o sorry yeah I don't know when exactly I will get to do the experiment I outlined, let alone more complex ones

#

would love to hear if you do

#

and if you need MoE code I just wrote some new very short code that works within deepspeeds MoE stuff

#

but tbh you're better off testing it without all the trimmings, in which case its ez nuff to just write (without expert parallelism support etc)

still grail Feb 29, 2024, 10:14 PM

#

If there's any good raw PyTorch code I can take a gander, otherwise I can write it myself

#

It doesn't seem too complicated, especially as the routing embedding simply assumes that the switch controls every pair of branches the same way at each layer

dawn vine Feb 29, 2024, 10:15 PM

#

yeah the only tricky part is that the switching has to be per token

#

and yet you want to operate in parallel

still grail Feb 29, 2024, 10:16 PM

#

Yeah I'm probably gonna be lazy and just post hoc merge

dawn vine Feb 29, 2024, 10:16 PM

#

but you can just do all the calculations and choose which to use

still grail Feb 29, 2024, 10:17 PM

#

But I'd like to maybe take advantage of 1:2 / 2:1 sparsity whatever sparsity if it's not horribly borked, to do the leave interleaving trick

#

Oh. Hm. Maybe there are some fun ideas heresies. My experience leads me to believe that a universal, singular kernel launch is faster, but stills.......

#

Hm

dawn vine Feb 29, 2024, 10:18 PM

#

well u can experiment with it being slow and crummy and wasteful and if you find something here is good we can then worry about an efficient implementation

#

as long as its fast enough to run some good tests

still grail Feb 29, 2024, 10:19 PM

#

Yeah exactly

#

I've beat my head into the wall too often looking for clever fast implementations when sometimes just trying a bunch of dumb, (or maybe better yet, "dumb") shit works out pretty well in da endsies. ❤️ :")))) 👍

dawn vine Mar 1, 2024, 3:34 AM

#

@still grail i was thinking some more about all this, and got to thinking about the relationship with DeepSeekMoE where they split up the FFN experts in to a zillion tiny parts which they mix and match

#

seems like our tree approach has similarities, except that we were going to choose certain wider experts as lower levels of the tree, where deepseekMoE is more like a shotgun, choosing say 50 out of 1000 same sized tiny experts

dawn vine Mar 1, 2024, 3:54 AM

#

in the limit, deepseekMoE of a d_model x d_ffn x d_model FFN becomes d_ffn different minimal d_model x 1 x d_model FFNs
so what if we hash shuffled the actual minimal ffns, such that out 1000 we choose say 50 out of which we create a dynamically generated d_model x 50 x d_model FFN
effectively, each token gets its own custom FFN made out of 50 of 1000 minimal parts
this seems ideal but maybe very slow to compute
so maybe the idea is the tree approximates it but is less tricky for GPUs

fallen spear Mar 1, 2024, 5:36 AM

#

still grail Just because duplicating knowledge feels kinda like a Bad Thing, especially if w...

you could also just reduce it by applying consistency loss between experts

still grail Mar 1, 2024, 5:37 AM

#

fallen spear you could also just reduce it by applying consistency loss between experts

i'm listening

still grail Mar 1, 2024, 5:38 AM

#

dawn vine <@285976230409404416> i was thinking some more about all this, and got to thinki...

yeah that is an interesting approach too. the overhead maybe then moves to like you said how to mix nad match all of the various experts there

fallen spear Mar 1, 2024, 5:39 AM

#

still grail i'm listening

there are old papers that do this with weird nonstandard architectures etc

#

it works okay

dawn vine Mar 1, 2024, 5:39 AM

#

still grail yeah that is an interesting approach too. the overhead maybe then moves to like ...

well i think using a tree manages that expense via a tradeoff using its log-ness of depth vs width at each level

fallen spear Mar 1, 2024, 5:39 AM

#

but if you enforce a consistency loss between experts it forces them to share information

#

you can also have

#

a lora

#

it can even be full rank

#

but

#

one of the matrices shared and the remainder experts

dawn vine Mar 1, 2024, 5:40 AM

#

fallen spear a lora

splitting the FFN up a la deepseekMoE is essentially decomposing it into its constituent LoRAs

fallen spear Mar 1, 2024, 5:40 AM

#

there are a few sort of clear ways to reduce duolication

still grail Mar 1, 2024, 5:41 AM

#

dawn vine well i think using a tree manages that expense via a tradeoff using its log-ness...

agreed def

fallen spear Mar 1, 2024, 5:43 AM

#

dawn vine splitting the FFN up a la deepseekMoE is essentially decomposing it into its con...

i had to go pull this but i remember this one

#

i disagree

#

lora can be full rank and the interaction between the decomposed matrices is multiplicative

#

these run "side by side" with each other

dawn vine Mar 1, 2024, 5:44 AM

#

fallen spear i disagree

how come? the math is identical isnt it if you split apart all the middle layer neurons from a big FFN into their own tiny FFN and sum the results of all of those tiny ones?

fallen spear Mar 1, 2024, 5:45 AM

#

dawn vine how come? the math is identical isnt it if you split apart all the middle layer ...

no, this concatenates the smaller matrices, lora matmuls them

dawn vine Mar 1, 2024, 5:45 AM

#

maybe we mean different things when we say LoRA

fallen spear Mar 1, 2024, 5:45 AM

#

possibly

dawn vine Mar 1, 2024, 5:45 AM

#

i really just mean a bottleneck 'FFN'

#

LoRA tho is typically adding the results of such a bottleneck to the base value of a full FFN

fallen spear Mar 1, 2024, 5:46 AM

#

yeah this is true

dawn vine Mar 1, 2024, 5:46 AM

#

sorry, I guess I'm abusing terms when I say LoRA and mean just a bottleneck

fallen spear Mar 1, 2024, 5:46 AM

#

i am proposing no bottleneck

#

and no base

dawn vine Mar 1, 2024, 5:47 AM

#

all I was saying is that a d_ffn wide FFN is made up of d_ffn 1-wide bottlenecks, summed
and that maximally, this is what DeepSeekMoE does as a decomposition

fallen spear Mar 1, 2024, 5:47 AM

#

you simply express each ffn as a product of two matrices

#

i mean, that is true

fallen spear Mar 1, 2024, 5:47 AM

#

fallen spear you simply express each ffn as a product of two matrices

one of these is shared over all experts

#

one is not

#

presumably shared knowledge will end up in the shared one

dawn vine Mar 1, 2024, 5:48 AM

#

'each ffn' as product meaning each expert?

#

sorry, are you saying that the expert's weights should be the results of two matrices?

fallen spear Mar 1, 2024, 5:50 AM

#

yes

dawn vine Mar 1, 2024, 5:50 AM

#

ahh

fallen spear Mar 1, 2024, 5:50 AM

#

and one of those two should be shared over experts

dawn vine Mar 1, 2024, 5:50 AM

#

the mixup was that I thought you were referring to how to calculate the expert, but you were referring to a kind of generative fast-weights idea of how to create the actual feedforward weights themselves rather than learn them directly

fallen spear Mar 1, 2024, 5:51 AM

#

tbh i have never understood why there isn't more weird matrix factorization going on in general

#

we have a very solid proof of concept that in one specific case the technique gives you vast gains in efficiency

still grail Mar 1, 2024, 5:52 AM

#

I think it's that it can create a lot of potentially unwanted instability over the course of training

#

Depending on how it's done

dawn vine Mar 1, 2024, 5:52 AM

#

well this is pretty out there... i mean, i havent seen any code ever that learns weights indirectly like this

#

presumably due to stability reasons

#

in other words, i love it

fallen spear Mar 1, 2024, 5:54 AM

#

still grail I think it's that it can create a lot of potentially unwanted instability over t...

i love this explanation thank you

#

not knowing why something isn't done makes me think I should not do it

dawn vine Mar 1, 2024, 5:55 AM

#

oh wait, i thought that was supposed to be a double negative... does it make you want to do it or not?

#

🤣

fallen spear Mar 1, 2024, 5:55 AM

#

it makes me want to do it, not knowing makes me assume that there's some good reason I should not that i don't know yet

#

instability is an interesting problem here

#

speaking of

still grail Mar 1, 2024, 5:56 AM

#

I'd say go for it if you want if you're wanting my personal perspective on it, you just have to keep in mind that having weights generate weights as a function of other weights will sorta make the Jenga tower a bit more rocky unless you're rather clever somehow with it. :3 ❤️ :'))))) 👍

fallen spear Mar 1, 2024, 5:56 AM

#

i have given up on using neox for ordinary experiments, it is just really not hackable enough

dawn vine Mar 1, 2024, 5:56 AM

#

fallen spear i have given up on using neox for ordinary experiments, it is just really not ha...

use my gptcore 🙂

#

its specifically designed for LLM experiments

#

everyone who has used it loves it (that's not many people tho, and im sure it can be improved)

fallen spear Mar 1, 2024, 5:57 AM

#

dawn vine use my gptcore 🙂

isn't it on an rwkv-family architecture?

dawn vine Mar 1, 2024, 5:58 AM

#

fallen spear isn't it on an rwkv-family architecture?

nope, it supports general transformer-like stuff modularly w/ high configurability

fallen spear Mar 1, 2024, 5:58 AM

#

welp

dawn vine Mar 1, 2024, 5:58 AM

#

i dont actually specifically want you to use it, it just seems convenient for the task at hand

#

since it was born out of my frustration w/ all the existing tooling

fallen spear Mar 1, 2024, 5:59 AM

#

i have it and two other things bookmarked as maybes

dawn vine Mar 1, 2024, 6:00 AM

#

it definitely has deficiencies.. for example its not designed to load existing weights particularly from other projects

#

its just designed for doing experiments w new architectural changes

#

esp on consumer hardware

fallen spear Mar 1, 2024, 6:01 AM

#

i think i am definitely in an experiment running place, i will probably fiddle around before i settle on what's comfortable

#

i have uh

dawn vine Mar 1, 2024, 6:01 AM

#

if u find something better for that i'd love to know that too!

#

bc i might use it 🙂

fallen spear Mar 1, 2024, 6:01 AM

#

i have caught an excess of thing to do in general so i am probably out if commission at least another week

#

the other two things i know of with similar are chili and neuralink's thing and another sandbox someone has

#

lemme get off of mobile

dawn vine Mar 1, 2024, 6:02 AM

#

cool, link me to em when u get a chance id like to check them out

dawn vine Mar 1, 2024, 6:07 AM

#

fallen spear i have caught an excess of thing to do in general so i am probably out if commis...

my excess of things to do is mostly to get RWKV MoE working the best it can so I can train it

fallen spear Mar 1, 2024, 6:08 AM

#

dawn vine my excess of things to do is mostly to get RWKV MoE working the best it can so I...

yeah i have uh a completely separate day job that is currently trying very hard to snag all my attention

#

https://github.com/huggingface/nanotron this one for sure

#

i have another one somewhere

#

https://github.com/HMUNACHI/nanodl this one in jax, but that's not the one i am thinking of, which I have apparently lost

dawn vine Mar 1, 2024, 6:44 AM

#

hmm nanodl looks the most relevant, but jax and not focused on data based configurability and ease of running

still grail Mar 1, 2024, 5:37 PM

#

@dawn vine i think i figured out a good general landing zone for a dynamic weight communication motif

#

(whoops, sent it earlies deresies 😬 😬 😬 😬 😭 😭 😭 😭 )

#

i think if'n we do indeed try to lean on the input token (or something similarly-fixed like that) as the keys for choosing the experts within some paradigm -- if that holds up in some way (which i think it should, even based on the hashing business that you were doing earlier), then all methods that rely upon that for dynamic weight generation i think can do AOT lookup and communication if the function combining those values is linear

#

I know that's for dynamic weight generation which is a little different

dawn vine Mar 1, 2024, 5:39 PM

#

u can ping me any time ill have it off if im asleep or whatever 🙂

still grail Mar 1, 2024, 5:39 PM

#

but being able to do it AOT should hide some of the latency of communication a bit

#

just simply because the only dependency for generating weights is "input token", and so that means it's easier to do all of the layers sorta all at once instead of with a sync each time

#

might not be useful but just something i was chewing on w.r.t the distributed side of the dynamic weight generation side/part of the problem. :3 🙂

dawn vine Mar 1, 2024, 5:41 PM

#

still grail might not be useful but just something i was chewing on w.r.t the distributed si...

that's a great point... not sure if comms can happen earlier or not, gotta consider that a bit

#

like either way u gotta wait for the prior layer/block to complete so you have the embeddings to transfer over
but yeah at least you don't THEN have to operate on them before knowing where to send em

#

i skimmed a paper that did a kind of pipelined work to allow the comms to get pipelined as well

#

i assume they split up the prior layer's work into chunks, so that each chunk could start getting sent across the wire when it completes

#

https://arxiv.org/abs/2304.11414

arXiv.org

Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism

The Mixture of Experts (MoE) model becomes an important choice of large language models nowadays because of its scalability with sublinear computational complexity for training and inference. However, existing MoE models suffer from two critical drawbacks, 1) tremendous inner-node and inter-node communication overhead introduced by all-to-all di...

still grail Mar 3, 2024, 7:01 PM

#

@dawn vine @fallen spear still working on this, definitely coming along the train of thought that 'switching is more expensive up front even if it gives more capacity', I've visited the idea of branching/cloning weights in the past so I want to revisit/keep visiting these things.

#

As far as the binary tree stuff goes, I'm still exploring this! Trying to find a semi-efficient proxy that's maybe's a halfway solutions heresies.... ❤️ :')))) But I am definito makin' some progress.

#

Right now I'm exploring something semi-completely different that I've visited a number of times over the past few months, some of @fallen spear ' comments keep spurring me to look at it again.

#

Some promising initial progress on that, but no word on it yets. ❤️ :'))))

dawn vine Mar 3, 2024, 7:48 PM

#

Cool, I've been temporarily distracted from MoE by other stuff but eager to get back to it!!

dawn vine Mar 12, 2024, 5:17 PM

#

@still grail im back at it on MoE experiments now if you want to revisit

#

some other relevant stuff:
https://discord.com/channels/729741769192767510/1182809983520362597
https://discord.com/channels/729741769192767510/1181904795431358485

dawn vine Mar 13, 2024, 3:39 PM

#

fallen spear can you salt it separately per layer and check if it improves, for curiosity?

I finally tried this, and it was a little better

#

Unclear if that's because the new hash function was just luckier or it actually matters

#

I multiplied by a prime number for each layer: expert_idx = (token_idx * primes[layer_id]) % num_experts

fallen spear Mar 13, 2024, 3:44 PM

#

dawn vine I finally tried this, and it was a little better

lines up with original hash layers paper fwiw

dawn vine Mar 13, 2024, 3:44 PM

#

fallen spear lines up with original hash layers paper fwiw

cool, good to know!

#

i may really use this in production lol
kinda scared

#

i also have a maybe even more nutso routing mechanism (like how I actually choose where to send each token, not just what the hash said to do) thats like beyond stupid

#

but as a result its fast and wastes no 'capacity' 🙂

#

coming soon to a RWKV near you ™️

still grail Mar 13, 2024, 5:33 PM

#

dawn vine <@285976230409404416> im back at it on MoE experiments now if you want to revisi...

ive been revisiting but i freaking keep getting pulled away on related tangents lol

#

like its turned into an entire freaking pre-release for hlb-gpt that has nothing to do with moes (not really at least, lolz)

#

and for some reason i have cleaned up the entire freaking codebase and annotated it more

#

this has be about one of the strangest tangents maybe i've taken in my life but well, here we are lol

#

hopefully it pans out, doing tuning experiments now that wikitext-103 is back online :3 ❤️

dawn vine Mar 13, 2024, 5:38 PM

#

lol! okay well sounds productive in some direction at least!!!

fallen spear Mar 13, 2024, 10:07 PM

#

still grail hopefully it pans out, doing tuning experiments now that wikitext-103 is back on...

i was all "oh shit, there's a different wikitext" and it was just. the same one i have

still grail Mar 13, 2024, 10:09 PM

#

fallen spear i was all "oh shit, there's a different wikitext" and it was just. the same one ...

there was... a hunt for it (the raw version at least)

granite plover Mar 14, 2024, 2:59 PM

#

fighting the urge to buy the wikitext-103-raw-v1.zip domain just to serve it

still grail Mar 14, 2024, 3:02 PM

#

granite plover fighting the urge to buy the wikitext-103-raw-v1.zip domain just to serve it

haha nice

#

i love the community redundancy so freaking much, lolz ❤️ ❤️ ❤️ ❤️ :'))))

dawn vine Apr 4, 2024, 5:29 PM

#

sorry @still grail I figure this is maybe the right place to discuss MoE-like and FFN size related things after all 😁

#

anyway I was interested to see you chose 2x ffn in HLB-GPT as the default

#

I'm about to embark on a larger 1.5B training run for my RWKV-6 MoE using hash routing

#

seems to work great, and I use a somewhat unusual configuration where I put MoE as a second FFN after the normal pretrained one, but additively (it starts at zero contribution and I use this to continue a pretrained non-MoE model)

#

would be great to hear where you landed on your MoE experimentation and if you ended up with any of the tree-like stuff in there

#

and I think this is all interesting to consider relatively with regard to Mixture of Depths

still grail Apr 4, 2024, 5:35 PM

#

dawn vine anyway I was interested to see you chose 2x ffn in HLB-GPT as the default

hihihihi! i'm on a quick brain break from doing some intensive hlb-gpt dev, i can give some light answers now and maybe some more involved answers later if needed

dawn vine Apr 4, 2024, 5:36 PM

#

cool, thanks!

still grail Apr 4, 2024, 5:38 PM

#

dawn vine anyway I was interested to see you chose 2x ffn in HLB-GPT as the default

you might rock your own socks off at this, but this is actually technically a 1x parameter! 1x for local, 1x for remote. I couldn't find a configuration that performed better, twiddling the v_dim and the expand_dims around (univariate + multivariate experimentation case) didn't seem to do so well. some of that might just be linear/blocksize stuff/whatever, some of it might be something else, it felt more like a 'something else' than a raw hardware efficiency thing as runtimes seemed to be not too dissimilar between them.

The one direction that did seem to have the 'lowest loss' of performance was increasing expand dim, and I think increasing v_dim? However, the ratio between v_dim and the qk_dim should always be 8, the dim reduction parameter ties the qk_dim to the normal dim so you might have to fiddle a bit there if so

#

this network seems to really, really, really like going a bit narrower and deeper for some reason

still grail Apr 4, 2024, 5:39 PM

#

dawn vine would be great to hear where you landed on your MoE experimentation and if you e...

I did! that's how i ended up on the hlb-gpt release, oddly enough, lol, ended up experimenting with efficient layer layouts as I couldnt' find a super fast MOE solution, and here we are

#

I did do some initial tree stuff, but nothing too complicated, and it didn't pass my complexity and/or speed requirement test, so I put it back on the shelf until I could think of something with a similar vibe that was super efficient

still grail Apr 4, 2024, 5:40 PM

#

dawn vine and I think this is all interesting to consider relatively with regard to Mixtur...

ive tried training on stochastic depths just as like a warmup kinda thing (had some serious bungee-cording vs the original, 'train all layers at once' strat), don't have completely thought out thoughts on that but it is very much an interesting topic for suresies.... ❤️ :'))))

dawn vine Apr 4, 2024, 5:43 PM

#

still grail you might rock your own socks off at this, but this is actually technically a 1x...

oh for FFN you just do like d_model->d_model ('expanded' 1x)->d_model? im missing something about the meaning of local vs remote in this context

and you use a truly giant V eh? thats interesting i never saw that give me much gains

#

oh maybe this V size vs QK is related to not using attention heads!

still grail Apr 4, 2024, 5:45 PM

#

dawn vine oh for FFN you just do like d_model->d_model ('expanded' 1x)->d_model? im missin...

This is a fused mlp+attention model, we do attention in the nonlinear latent space of the geglu! and some proportion of it is assigned to local-only (no attention) and some to remote. mixing and matching seems to incur a computation penalty with little gain, though there may be some methods/avenues out of that. Not entirely sure!

dawn vine Apr 4, 2024, 5:45 PM

#

ah I grok your meaning of local vs remote now

still grail Apr 4, 2024, 5:45 PM

#

dawn vine oh maybe this V size vs QK is related to not using attention heads!

Yeah I have sort of a suspicion for this, I feel like the nonlinear stuff is sort of doing the work of information containment that an attention head would do, just maybe more effectively as still the attention heads are linear? maybe?

#

I'm honestly not sure, that thought is still like 35-45% thought out or whatever

dawn vine Apr 4, 2024, 5:46 PM

#

the local vs remote part seems to me related to the stuff @boreal moss had discussed with me and then here earlier about splitting off a subpart of the embedding to do attention on

#

so that you can do 'smaller attention' on a huge embedding

still grail Apr 4, 2024, 5:46 PM

#

yes it is also exceedingly computation efficient

dawn vine Apr 4, 2024, 5:47 PM

#

still grail yes it is also exceedingly computation efficient

yeah i love that part

still grail Apr 4, 2024, 5:47 PM

#

many, many fewer kernel launches, and im guessing the training stability is much better too as you're not having to forcibly pass things through a longer residual for certain kinds of operations

#

though I don't have a good example at hand that shows that exactly

#

just sort of, if you had some kind of feature that required 3 attention operations or whatever, having to pass through 3 residuals instead of 6 bodes extremely well for not just computation time, but also training stability I think

#

that's again a bit arbitrary and not necessarily grounded in the reality of what's happening there, just trying to paint a bit of a rough picture of some of the reasoning there

dawn vine Apr 4, 2024, 5:49 PM

#

yeah i tend not to care too much about kernel launches bc I pretty much require torch.compile which does away with that as an issue if done right
but stability thru fewer layer-like things could certainly be a benefit

#

and it still is probably quite a bit faster since it can use GPU parallelism more fully

still grail Apr 4, 2024, 5:50 PM

#

well some of my experiments are just like ~40-100 seconds or so which is less than a lot of compiles so for now at least im staying flexy (am experimenting with some potential partial compiles for the longer runs tho

still grail Apr 4, 2024, 5:50 PM

#

dawn vine and it still is probably quite a bit faster since it can use GPU parallelism mor...

yes, a fused kernel for this would be absolutely unreal

dawn vine Apr 4, 2024, 5:51 PM

#

yeah hey im very into rapid iteration - i feel like no one else but you and I like that in the ML world 🤣

still grail Apr 4, 2024, 5:51 PM

#

i have no idea why, the days where I run 4 long experiments vs 200-300 fast experiments are like so night and day different lol

dawn vine Apr 4, 2024, 5:51 PM

#

agreed 100%

still grail Apr 4, 2024, 5:51 PM

#

long as the proxy scales (and my workflow is built to try to make sure that it does)

dawn vine Apr 4, 2024, 5:52 PM

#

yeah that part im unsure about tbh

#

there are a ton of things ive found which work amazing until like 1-2GTok in

still grail Apr 4, 2024, 5:52 PM

#

dawn vine and it still is probably quite a bit faster since it can use GPU parallelism mor...

you can interleave the geglu layer downward projection + add to residual for example with the attention layer, which I think would make it even more memory efficient, however I am not entirely sure about this one

still grail Apr 4, 2024, 5:53 PM

#

dawn vine there are a ton of things ive found which work amazing until like 1-2GTok in

yeah

#

well part of this too I think was thinking about the transformer scaling issues

#

the linear attention value I think doesn't scale precisely because it is linear

#

it can't suppress stuff

#

so in order to do so the network has to absolutely spike

#

and only the large models seem to handle that

still grail Apr 4, 2024, 5:53 PM

#

still grail so in order to do so the network has to absolutely spike

(also a bear for training stability)

dawn vine Apr 4, 2024, 5:54 PM

#

are you talking about linear attention? or the value of attention in traditional softmax dot product self attention?

still grail Apr 4, 2024, 5:54 PM

#

linear values

#

at least that's my hypothesis for why the spiking occurs

#

i really should check to see if that happens in these networks as well

dawn vine Apr 4, 2024, 5:55 PM

#

we dont find a lot of problems in training stability for RWKV btw

still grail Apr 4, 2024, 5:55 PM

#

that's lovely

dawn vine Apr 4, 2024, 5:55 PM

#

theres a lot of normalization that goes on which may help with that

#

and linear_attention/linear transformer is just kind of fundamentally different too

still grail Apr 4, 2024, 5:57 PM

#

dawn vine there are a ton of things ive found which work amazing until like 1-2GTok in

that's a fair point, and something i'm sort of scared of honestly. only in the 3.2ishB token ranges for now

dawn vine Apr 4, 2024, 5:58 PM

#

still grail This is a fused mlp+attention model, we do attention in the nonlinear latent spa...

your setup is a little more like mamba in a way where they expand to get V, do SSM style 'attention', then contract again - they dont have a separate FFN

still grail Apr 4, 2024, 5:58 PM

#

dawn vine and linear_attention/linear transformer is just kind of fundamentally different ...

well also, the data-dependent gating seems to be an interesting/novel thing, i was reading through your code to try to understand it a bit better, and the double-lerp "check if trajectory is here based upon some data-dependent scaling, then scale the delta in trajectory by some lookup value if so" (if I understood that code correctly) method seemed really interesting, for example

still grail Apr 4, 2024, 5:58 PM

#

dawn vine your setup is a little more like mamba in a way where they expand to get V, do S...

gosh I need to get over the SSM hump

#

I feel like I might become a state space model fanatic if I understood the whole general idea/had a good toolkit behind it. Sorta gotta still work on wrapping my head around that.

dawn vine Apr 4, 2024, 5:59 PM

#

still grail well also, the data-dependent gating seems to be an interesting/novel thing, i w...

you can read the paper if its easier in the RWKV-papers channel or preprint should be out Monday 🤞

still grail Apr 4, 2024, 6:00 PM

#

dawn vine you can read the paper if its easier in the RWKV-papers channel or preprint shou...

i could give it a go! I basically have math symbol dyslexia or the like so if it's heavy in that then I may just be best off sticking to reading the annotated code, that made sense to me (though it is certainly a new paradigm to work through! ❤️ :')))) )

dawn vine Apr 4, 2024, 6:01 PM

#

still grail i could give it a go! I basically have math symbol dyslexia or the like so if it...

im not amazing at math symbols even without any dyslexia component and I too prefer code if it's simple and well written

#

SSM is real simple conceptually if you need a quick explainer...

#

took me a while to understand from the papers tho, theyre a real bear

still grail Apr 4, 2024, 6:02 PM

#

dawn vine SSM is real simple conceptually if you need a quick explainer...

at some point that would be good, my brain is fried from coding right now 😅 . maybe later would be really cool! :3 🙂

dawn vine Apr 4, 2024, 6:03 PM

#

still grail at some point that would be good, my brain is fried from coding right now 😅 . m...

sure, anytime 🙂

#

re RWKV-6 I tried to add some other explainer stuff to the paper too so maybe its readable enough even tho its not code exactly - and the recurrent formulation math looks just like two lines of code, which is nice

dawn vine Apr 4, 2024, 6:18 PM

#

still grail This is a fused mlp+attention model, we do attention in the nonlinear latent spa...

ok so to clarify my understanding, if written unfused you:

att = attention(W_att_a(x) * GELU(W_att_b(x)))
ffn = W_ffn_a(x) * GELU(W_ffn_b(x))
out = cat([att, ffn])

and W_att_a etc. are all (d_model, d_model)

still grail Apr 4, 2024, 6:18 PM

#

dawn vine ok so to clarify my understanding, if written unfused you: ``` att = attention(W...

yep! (ignoring the q and k of course here)

dawn vine Apr 4, 2024, 6:18 PM

#

cool, did you try using different ratios for the ffn side vs att?

still grail Apr 4, 2024, 6:18 PM

#

shared norm layer and all of that

#

(which i like a lot)

still grail Apr 4, 2024, 6:19 PM

#

dawn vine cool, did you try using different ratios for the ffn side vs att?

yes! This is what I was talking about with the v_dim business up above here

dawn vine Apr 4, 2024, 6:19 PM

#

ah i see

still grail Apr 4, 2024, 6:19 PM

#

still grail yes! This is what I was talking about with the v_dim business up above here

#1169741769232089169 message

dawn vine Apr 4, 2024, 6:20 PM

#

The one direction that did seem to have the 'lowest loss' of performance was increasing expand dim, and I think increasing v_dim

still grail Apr 4, 2024, 6:20 PM

#

the way the v_dim is written is to try to encourage the user/etc to think of the whole thing as a shared space with different allocations for different places

#

yep

#

they all seemed to perform not quite as well, but that one seemed to have the least proportional hit to the number of parameters being changed

#

(i think i resized the net though to make sure the overall number remained roughly the same, not sure tho)

dawn vine Apr 4, 2024, 6:21 PM

#

yeah so this gets back to the topic of this thread, which is what the proper expansion ratio really is

still grail Apr 4, 2024, 6:21 PM

#

yep

#

it might be different for different things

#

at least for the 125M model as well, 1+1 seems to be holding strong on the more information dense wiki task

#

but maybe for like open pretraining, it will need a wider ratio

#

but it really does like going deep

#

the 125M model is like 28-32 layers deep lol

dawn vine Apr 4, 2024, 6:22 PM

#

i kind of think your way of doing this is better and is like what mamba should be doing

still grail Apr 4, 2024, 6:22 PM

#

XD

still grail Apr 4, 2024, 6:22 PM

#

dawn vine i kind of think your way of doing this is better and is like what mamba should b...

mamba is also on my long todo list of at least roughly understanding XD 😭 😭 😭 😭

dawn vine Apr 4, 2024, 6:22 PM

#

they end up doing 'attention' on the fully expanded version (2x or whatever) so its slow

still grail Apr 4, 2024, 6:22 PM

#

oh interesting

#

yeah at least for the keys i dont think its as necessary

#

which honestly makes sense, you do a lower dim check to see if you should move the activations, then the activation moving should take up the bulk of the work

dawn vine Apr 4, 2024, 6:23 PM

#

still grail Apr 4, 2024, 6:23 PM

#

if your precision on the lookups is super high then relative to the amount of work we're spending on the actual activations themselves it seems a bit wasteful

still grail Apr 4, 2024, 6:23 PM

#

dawn vine

interesting

#

yeah

dawn vine Apr 4, 2024, 6:23 PM

#

just think of SSM as attention here

#

conv is akin to the LERP we do in RWKV

still grail Apr 4, 2024, 6:24 PM

#

okay, gotcha

dawn vine Apr 4, 2024, 6:24 PM

#

its just like a kernel size 3 Conv1D

#

can ignore it

still grail Apr 4, 2024, 6:24 PM

#

oh dang thats tiny

#

wow

#

so basically stacking lots of local stuff

#

similar to the shifting business

#

(like you were saying earlier)

dawn vine Apr 4, 2024, 6:25 PM

#

yeah same idea, Bo uses it to allow there to be induction heads in a single layer

#

but its not relevant to anything we're discussing particularly

#

just a cheap added boost

still grail Apr 4, 2024, 6:25 PM

#

gotcha

dawn vine Apr 4, 2024, 6:25 PM

#

but you can see there how they expand 2x (or whatever size) up front

still grail Apr 4, 2024, 6:26 PM

#

well it sort of looks like you're forced to keep the higher dims in this motif, as the sigmoid gate post-SSM is also in that higher dimension

still grail Apr 4, 2024, 6:26 PM

#

dawn vine but you can see there how they expand 2x (or whatever size) up front

yeah

dawn vine Apr 4, 2024, 6:27 PM

#

in exchange for this they use 2x the layercount btw

#

bc they no longer have a separate FFN

still grail Apr 4, 2024, 6:27 PM

#

2x or half?

dawn vine Apr 4, 2024, 6:27 PM

#

2x!

still grail Apr 4, 2024, 6:27 PM

#

oh

#

interesting

#

i did try a no-ffns motif

#

it honestly was not too terrible

#

not the best but really not terrible

#

having that split seems to be useful somehow

dawn vine Apr 4, 2024, 6:27 PM

#

yeah i eventually got this style to work fine but it was slower

still grail Apr 4, 2024, 6:27 PM

#

(and maybe cheaper too XD)

#

I would love to see this split latent space idea come to recurrent networks

dawn vine Apr 4, 2024, 6:28 PM

#

still grail I would love to see this split latent space idea come to recurrent networks

happy to add it to RWKV going forward 🙂 just cant figure out how to shove it into my MoE which would be the easiest path

still grail Apr 4, 2024, 6:28 PM

#

ive been thinking about the MOE+this layer version of things

#

maybe its just as simple as a silly expert gating thing

#

maybe not tho XD

dawn vine Apr 4, 2024, 6:29 PM

#

well i have VERY simple crazy code now for moe that doesnt do much work

still grail Apr 4, 2024, 6:29 PM

#

still grail maybe its just as simple as a silly expert gating thing

I guess jsut for the local side of things? XD though the nonlinear layers can do computation in transit too

dawn vine Apr 4, 2024, 6:29 PM

#

yeah MoE on attention is nearly impossible

still grail Apr 4, 2024, 6:30 PM

#

dawn vine yeah MoE on attention is nearly impossible

yeah i haven't played around with it. I'm assuming at least on the qk side of things, is it also the V side of things that gets weird?

dawn vine Apr 4, 2024, 6:31 PM

#

still grail yeah i haven't played around with it. I'm assuming at least on the qk side of th...

im trying to come up with a succinct reasoning 🙂

still grail Apr 4, 2024, 6:33 PM

#

dawn vine im trying to come up with a succinct reasoning 🙂

it would make sense if its required that there be a fixed path for certain things, i could see (very loosely/roughly) how the training dynamics could get really weird if not

dawn vine Apr 4, 2024, 6:33 PM

#

attention is looking back at past tokens, but if you make those tokens vary depending on the expert, thats... hard to imagine an implementation

still grail Apr 4, 2024, 6:34 PM

#

yeah i was thinking about it sort of from the ODE-like interpretation of transformers

#

if you think about it as a process slowly moving vectors from one place to another, and each layer is specialized at one part of the process, then you honestly would have to switch them all at once, or not at all

#

and, the other layers sort of depend on that together as well (the mlp layers i thinks.... ❤️ :')))) )

#

I think the traditional MoE style of doing things assumes a very particular style of 'do XYZ routing locally per layer', which i don't think can (easily, at least) capture this more global kind of context to it

#

so i guess from that, I feel like moes on attention could work if done in an 'all-at-once', unified kind of manner

#

as you are at that point mainly just making a '1-of-n' selection of possible trajectory sets over the course of the entire network, from layer 0 to the output layer

#

i dont think it'll work otherwise really, as the local moes for the attention layers will be nearly almost nonsensical w.r.t. each other

#

that being said, i dont think that version of an explanation really works all that well for explaining how the FFN moes work, or would work w.r.t. a more globally-sliced attention MOE

#

(if that makes sense at all.... ❤️ :'))))

#

I'm sure there'd be some interesting set of slicing dimensions where you could slice almost vertically across attention MoE layers (subsets dimensionally, across the entire vertical depth of the network) instead of horizontally (big giant chunks at each block?), if this motif worked?

dawn vine Apr 4, 2024, 6:40 PM

#

i think i gotta spend some time adjusting my brain to thinking about time-related MoE before I can have a reasonable thought about it 🤣

still grail Apr 4, 2024, 6:40 PM

#

It might not, but I feel like that would sorta at least be a semi-required starting point unless there is some fantastic trickery going on here

still grail Apr 4, 2024, 6:40 PM

#

dawn vine i think i gotta spend some time adjusting my brain to thinking about time-relate...

fair enough, XD these are just the scattered ramblings of a slightly-woozy, highly-caffeinated researcher XDXDXDXD

dawn vine Apr 4, 2024, 6:42 PM

#

dawn vine

getting back to this flowchart for a second you do your gating before the attention which is an interesting difference

fallen spear Apr 4, 2024, 6:43 PM

#

there's a paper where they gate an entire layer ahead of time

dawn vine Apr 4, 2024, 6:43 PM

#

we do ours afterwards, and that's kind of more standard now

fallen spear Apr 4, 2024, 6:43 PM

#

it worked fine

#

they did this for resource reasons, it gives you time to fetch the expert from cache

dawn vine Apr 4, 2024, 6:44 PM

#

fallen spear there's a paper where they gate an entire layer ahead of time

pretty interesting!

fallen spear Apr 4, 2024, 6:44 PM

#

so you make your decision, perform other operations and run your expert un-caching in parallel

#

rant_about_hash_routing.jpg

#

most things work, routing is black magic

dawn vine Apr 4, 2024, 6:48 PM

#

fallen spear it worked fine

I think I don't understand what's meant by gating the layer ahead of time here!

still grail Apr 4, 2024, 6:52 PM

#

dawn vine I think I don't understand what's meant by gating the layer ahead of time here!

(like, question from me: during training, or during eval/inference?)

fallen spear Apr 4, 2024, 6:59 PM

#

dawn vine I think I don't understand what's meant by gating the layer ahead of time here!

the routing layer is a full ff layer earlier

#

so it runs iirc

#

gate -> some_ff > attn > gated_ff

#

but for every ff

#

i might have that paper somewhere, not sure

#

they did this specifically because they were trying to use an moe while very memory constrained so the extra time between the gate and the gated ff was for fetching it

#

especially: allowed them to have more experts in colder RAM

#

no wonder i keep forgetting golang trivia i am just filling my entire long term memory with this sort of thing

fallen spear Apr 4, 2024, 7:04 PM

#

fallen spear especially: allowed them to have more experts in colder RAM

this is an inference time concern to be clear

#

when running non batched mostly

#

in batch regime you need roughly all of your experts so this gets you nothing

dawn vine Apr 4, 2024, 7:49 PM

#

O ok I am mostly thinking about train time

#

still I guess this has bearing on when gating can occur profitably

fallen spear Apr 4, 2024, 8:03 PM

#

yeah they were running some model under fairly absurd resource constraints

#

i would be surprised if it matters a lot where you put your gates relative to the MoE layer

dawn vine Apr 4, 2024, 10:40 PM

#

still grail fair enough, XD these are just the scattered ramblings of a slightly-woozy, high...

https://arxiv.org/abs/2210.05144

arXiv.org

Mixture of Attention Heads: Selecting Attention Heads Per Token

Mixture-of-Experts (MoE) networks have been proposed as an efficient way to scale up model capacity and implement conditional computing. However, the study of MoE components mostly focused on the feedforward layer in Transformer architecture. This paper proposes the Mixture of Attention Heads (MoA), a new architecture that combines multi-head at...

#

they keep K and V the same across head experts, but allow Q and O to vary, to make it feasible

dawn vine Apr 4, 2024, 11:01 PM

#

conveniently, this setup works just as well for linear transformers!

soft bobcat Apr 5, 2024, 4:24 PM

#

@rose vapor wrote a paper on GLU variants: #research message

#

I personally did not find sin to work in any circumstance in LLMs

#

it's interesting it was the best one tested

dawn vine Apr 5, 2024, 4:25 PM

#

yeah saw that, was curious

#

@rose vapor did you ever try FOUR layers? (three multiplies)

soft bobcat Apr 5, 2024, 4:27 PM

#

the specific format he chose, I think would also result in gradient explosion in my tests

#

because the oscillations become very steep away from x=0

#

oh, he only tested applying one activation but not 3, rip

rose vapor Apr 5, 2024, 4:28 PM

#

I didn't have the budget to try other data modalities other than vision, but I provide a code snippet to implement the SinGLU function

#

I suspect the results I got are probably very dependent on the data augments common in vision transformers such as label smoothing etc.

#

I'm not sure what the equivalents are in language, sorry.

rose vapor Apr 5, 2024, 4:30 PM

#

dawn vine <@293131095615078402> did you ever try FOUR layers? (three multiplies)

What do you mean by FOUR layers?

dawn vine Apr 5, 2024, 4:30 PM

#

x1 * x2 * x3 * x4

rose vapor Apr 5, 2024, 4:30 PM

#

Ahh, nope

#

Just 1 - 3

#

With only one being passed through an activation as this is what SwiGLU does under the hood

dawn vine Apr 5, 2024, 4:32 PM

#

yeah we tried a lot of this stuff early on in this channel on language modelling, you can probably look back and see it all

rose vapor Apr 5, 2024, 4:32 PM

#

Awesome, will do

soft bobcat Apr 5, 2024, 4:32 PM

#

his test results are also interesting. this one sometimes does well

rose vapor Apr 5, 2024, 4:33 PM

#

Yeah the results sort of couple by number of matricies. I suspect this is because of the general shape of the output

#

With 1st order GLUs having a linear quality and 2nd order being sort of parabolic

#

Which order works best depends on the activation, with 2nd orders working well for Sigmoid, but 1st orders working well for Sin and Tanh

#

With sin it's easy to understand why. The first matrix does frequency modulation and the second does amplitude.

#

I suspect the Pre-LayerNorm is crucial for SinGLU to work. I have numerical experiments showing this mostly kills oscillations in the loss landscape

soft bobcat Apr 5, 2024, 4:36 PM

#

so you managed to restrict the input of singlu to a region around x=0. that prevents the huge gradient oscillations

rose vapor Apr 5, 2024, 4:36 PM

#

Exactly!

#d_ff/d_model + swiglu tests