#d_ff/d_model + swiglu tests
1 messages · Page 2 of 1
it does seem like a free lunch
have you seen anyone use 'attention experts' to similarly slice up q,k,v generation?
there is a paper that i and several other people remember reading that showed that ff was better
none of us remember what it is
it was probably a while ago
thanks!
still some question of whether it might be useful in the context of DeepSeek slicing tho and plus your all-layers idea
it is probably worth revisiting. since none of us remember which paper it was it is probably kind of old
it might be in the bibliography in the moe reading group channel somewhere
maybe a better use of parameters would be how they do it in Mamba, where they unify FFN and Attn into a single block, then double layer count
then you can apply the experts to both at once
(expansion and contraction matrices constitute an 'expert' here)
if you slice it maximally your "experts" are all single vectors and a single vector is agnostic for all purposes except magnitude
i shouldn't say magnitude, i guess i mean 'dimension' -- how many scalars are in it
yeah its true fundamentally 'where its most useful' is kind of a perpendicular concern to how MoE is done
which is nice
sort of but in practice no since nobody is deranged enough to use a single expert interchangeably in multiple places yet
just, in principle if sharing "experts" around you can put them anywhere they are correctly sized
whether it is a good idea is more difficult to determine
btw, I know you're hardware challenged so maybe this doesn't matter, but in case it proves useful it was hard won knowledge for me: don't bother running MoE on multiple 4090's unless they're NVLinked - you have to use the cards Nvidia doesn't hobble their cross-card bus comms, or the communication will destroy your speed completely
i'm on 3090s now, nvlink is slow to show up, my primary current bottleneck is being easily distracted by devopsy stuff
well any consumer level card they make all the comms go across the CPU or some crazy thing, and it makes MoE untenable across multiple cards
yeah, unsurprising. i guess: unless you have a full copy of every expert on each card
but then it will have to be very small. etc
yeah but the whole point of multiple cards in this context is usually to use Expert Parallel
so you can fit it all in VRAM by splitting em across cards
yeah ... idk, the scale at which moe seems viable leads me to question doing them on less than an a100 x4
the itty bittiest moes are still monsters of vram
yes, that and the thing with slicing layers up into sublayers are both viable small scale
so you may be able to use 8x experts but not increase the total param count at all, and still get some significant benefit (TBD!)
instead of MoE we should call it slice-and-dice FFNs
but they barely mention it
it is a frustrating paper
because it demonstrates that things which should not work do work
i gotta look at multihash...
https://openreview.net/pdf?id=lMgDDWb1ULW
starting on pg 8, has one table, is also in appendix a
i have mentioned this paper ad nauseum in the moe reading group channel and also read it for the group, which is on eleuther yt
so will refrain from restating too much more of that
my mistake: also mentioned on page 3 towards the end
and that copy omits the appendix
We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is comp...
so just use M hash functions, one for each of deepseek style M slices?
yeah. or: whatever you would normally use to route
hash routing is amazing bc it works at all
hashing the index of the token is a rock stupid method that still works
it is underexplored and can fail in weird ways and you probably don't wanna fuck with it if it's not your main goal to do so
sounds a little like this stupid/great paper
https://arxiv.org/abs/2401.02994
(basically, just run a totally random model of 8 for each subsequent token so they 'collaborate')
In conversational AI research, there's a noticeable trend towards developing models with a larger number of parameters, exemplified by models like ChatGPT. While these expansive models tend to generate increasingly better chat responses, they demand significant computational resources and memory. This study explores a pertinent question: Can a c...
but their multihash is just "what if instead of routing once you routed eight times"
turns out it works great
this naturally raises the question of why not route d_model times
well thats your usual thesis, right? (assuming it can be done performantly)
yeah basically
if the routing itself is fast and doesn't degenerate, doing d_model routings is at worst the same as doing one routing
basically the hash routing paper suggests the type of conclusion i find the most enticing
which is that some large and complex component is worthless
in this case, trainable routings
can u do it performantly just using torch.gather?
amd radeons can communicate directly, but pytorch support is junk of course
While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL tra...
i saw that! it is interesting that llama seemed to have decided that they didn't really care how wide the layer was exactly as long as it was a good multiple of the gpu ... whatever the thing is called. pool? the thing
if you use 3 input layers to the activation, the dimensions become 2x from 8/3x
however, I haven't yet found a layer that beats 2 input layers
I think it vaguely matters because of wave quantization; if both d_model and 8/3 d_model must be convenient for wave quantization, that could be an annoying restriction
if you link me to something explaining what wave quantization is i will owe you the soul of my firstborn
it was also mentioned in the article linked above
GPUs accelerate machine learning operations by performing calculations in parallel. Many operations, especially those representable as matrix multipliers will see good acceleration right out of the box. Even better performance can be achieved by tweaking operation parameters to efficiently use GPU resources. The performance documents present the...
Would be interesting to design a model specifically for some batch size given some specific gpu
i personally would probably be inclined to target getting exactly under 48gb of vram
new thought about zerO init: would it suffice to add randomness by permuting your hadamards
assuming you have more than one
permuting them differently would produce a different result
assuming you did not permute them identically
@soft bobcat @still grail can yall document any fresh runs on weird activations here or ideas therefore so i don't have to search ot for it tomorrow
@sage jetty if you want to noise up the channel I would not be mad either
sure, but most likely that's the final run
i will hopefully be running these on pythia 14m tomorrow
x_mult = x1 * x2 * x3
x = torch.tanh(x1) * torch.sign(x_mult) * torch.pow(torch.abs(x_mult) + 1e-8, 2/3)
``` was fern's idea
x = torch.tanh(x1) * x2 * x3
was something I've had lying around, but it's not my best
yeah i think this is reasonable but hard to predict ahead of time when it would do well, it sorta makes sense to me at leasties thosies.... ❤️ :'))))
the bad news is that I don't have a diversity of great activation functions lying around
some of them are just better than others
apparently clipping the up projection on pythia 70m only takes off 10m parameters
this seems incorrect but i'm going with it
Check the size of your output layer in param &s.
it's rather large
Ye
10m total for ff still feels absurdly small
oh, fifty million of the params are vocabulary
okay sorry i realized it would be high but not that high
i've cut my non-embedding params in half, that feels correct actually
51m params embedding layer on a 70m model and i just shaved off half of my non-embedding parameters, what do we bet on relative performance
pythia 70m on owt2, 13b tokens as baseline:
https://wandb.ai/segyges/neox/runs/h1gsuoo9
how much worse is it with no ff, which shaves off 10m params
for the sake of argument i'll say identical but it won't be
it does not seem to give a shit
so far
will see once the entire thing is done
zoomed in last hundred steps
yeah so my bet is they converge before the run ends
like, completely
my second bet is that they don't converge before the run ends, but do if i zero initialize both for reruns
my third bet is that if someone lets me pursue this harebrained ablation at a non-toy-model size where ff is actually the dominant part of model size the closeness of fit remains regardless of whether either preceding bet pans out
at which point maybe you are better off training for ~4x as long
they probably won't converge. that thing is a loss spike for green so it got a little worse at that point
i love it when my bets aren't entirely theoretical
my reasoning is that the gap is larger earlier
for scale of how small the gap actually is
the reason i will be wrong is if the reason this is so miniscule is that 5/7ths of the model's brain is in its embedding layer
and the ff actually matters at non-toy scales
i expected it to be less nice and to have to fiddle the initialization
at this scale
so that's pleasant
What are you guys considering to be convergence?
lr has a minimum scale
Sorry was missing context here, nvm
yeah that's a ppl loss
my current theory is that the up projection in a transformer ff doesn't do anything but slightly help signal propagation
i was thinking about testing more weird activation functions and then i realized that this was still more interesting
and it calls for me to fiddle the initialization anyway which, if activation functions are mostly about signal propagation, is still a good test to have
kevin wins, they're diverging
will finish the run for completionism's sake
amusingly the divergences show up right after loss spikes
it's not divergent like not training, it's just not getting closer
https://api.wandb.ai/links/segyges/7h1367t2 figured out how to share this correctly now
divergences basically appear after loss spikes and then don't come back so i don't feel, uh, disinclined from the theory that this is actually because the up projection helps signal propagation
the smaller model isn't worse generically it's just modestly less stable
also, wandb draws the graph badly, on a different view it looks like there's a NaN/inf value at that loss spike at that point but i am pretty sure that's just an error
to restate the hypothesis and prove that i have thought about mup too hard: any time you change the activation function or the width, you are implicitly changing the ratio of the model gradient relative to the model weights if you aren't using an initialization that prevents that. this can help or inhibit signal propagation. if this change is the important one with regard to, e.g., changing activation functions then the benefit will tend to trail off if using an initialization that doesn't change this ratio with scale and otherwise making sure you propagate gradients well
the dumbest test of this hypothesis is not to change activation functions, which is complex, it is to ablate the dimension of the model under both the existing initialization and conditions which maybe make it more stable
If you're using Adam, I don't think the scale of the gradients relative to the parameters actually matters, because the Adam updates are essentially invariant to the scale of the gradient, assuming the optimizer state has time to adapt. However, MuP admittedly does do other helpful things, like help you set the Adam lr, prevent softmax saturation, etc.
well i'm gonna zero initialize it and change the softmax scale
This makes me wonder why we don't just use Lion for everything. If Adam is invariant to gradient scale in the limit, why not just ignore the scale of the gradient and use the sign instead? It's cheaper, faster, and doesn't require any tricks to prevent loss spikes (since there won't be any).
It ignores the scale of the gradients? Does this mean that if I'm using automatic mixed precision I won't need to scale them to avoid vanishing gradients as long as I'm using Adam?
i am positive my logging at least thinks i am still scaling them with mixed precision and adam on neox
i suspect optimizer behavior is complex enough that reasoning directly about basically their behavior at some limit isn't true in practice
ie adam is scale invariant assuming (some stuff regarding the second moment estimation) which means it's not scale invariant
it's normalizing the scale somewhat
Right, gotcha, so we still should be scaling loss to avoid vanishing gradient problem then...
an actually scale invariant adam update is a spherical cow with zero mass that experiences no friction
easy to reason about, doesn't exist
That takes me back to high-school physics that does 😂
@buoyant turret pinged here because i have spammed other channels about this enough
i don't think the hopfield network thing is unreasonable
i do think it's unreasonable to assume that the original transformer paper got the good ratio correct at 4x in 2017 and it remains true at all scales of d_model
i am sorry/happy to have successfully conveyed my obsession at least enough that you're considering it
ideal ratio? no. But when comparing capicty the question becomes does SwiGLU result in better intermediate 'keys' to do retrieval from 'hopfield network' which allows us to scale back to 3/8 compared to 4? I'm less concerned about the ideal ratio, and more concerned about what ratio has equivalent capacity
yeah the 8/3 is just equiparams to 4x for a simpler activation there is no reason to think it's ideal
i don't think there's even a very strong theoretical analysis of swiglu/geglu
everything i have found is just "empirically this seems to work"
including noam shazeer saying he has no idea why they work and attributing their success to divine benevolence
so if you did "assuming ff up projection works as a hopfield memory, switching from 4x swish to 8/3rds swiglu will have XYZ consequences for it" that would i think be novel
I wonder if we could try replicating the ROME paper at varying FFN ratios with and without SwiGLU and just see wtf is up
only problem is we need to pretrain several models with varying FFN ratios with and without SwiGLU first before we can even try ROME
oh boy a whole bunch of pretraining runs, that sounds easy and not complex at all
also, now I have to read ROME
actually it looks complex but not horrifically computationally expensive
its not that computatiomally expensive, the main issue is getting the pretrained models in the first place and then identifying the 'best' layer to perform knowledge editing as that could very well be different for every model, which then adds a-whole-nother layer of ablations
https://arxiv.org/abs/2202.05262v5 paper for reference
We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a dis...
yeee, that's the paper I was talking about
oh fwiw: my guess that 4x is bad is largely predicated on the notion that modern models are so large that they effectively cannot be bottlenecked on available ff storage
it maybe made sense at 2017 scales and also with 2017 attention it meant that you avoided the ff pass being "cold" for gpu utilization during the forward pass
like, you might as well put parameters there because otherwise probably compute that is provisioned to hold the attention calculation is idle
neither of those concerns currently seems reasonable
since i'm putting zerO initialization in neox i need a name that isn't terrible because "zero" is not a reasonable name for a config setting that isn't setting something to zeroes, i am renaming it to identity-hadamard or iu-hd (identity up, hadamard down) barring objection
I'm testing 8/3x now and I arguably have even better compute saturation because I'm able to slightly increase the micro-batch size.
identity-hadamard is probably fine
The success of swiglu seems to suggest to me that the real bottleneck for a transformer is not necessarily the number of patterns (d_ff) it's able to recognize, but its ability to recognize them.
For a transformer to reason about an entire sentence or paragraph and predict the next token it basically needs to squeeze all the information about the paragraph into its hidden state of size d_model. Now, to make the model more expressive, we can do one of two things. One, we can increase d_model, giving the model a larger hidden state. Two, we can make the FF (pattern recognition) layer more complex. This allows us to encode more information within d_model than we could otherwise, because we have a more powerful method for processing the hidden state.
Ultimately I think the limiting factor for transformers is their hidden state, we either need to make it larger, or we need to give transformers ways to make more of limited space.
Thank you, good, this is sensible I thinks.... ❤️ :')))) (or even if there's an informal non hadamard name, that might be less confusing, as hadamard multiplies make a confusing name collision that for me for a while)
and trained for 6 trillion
apparently they disagree with your assessment violently
i am definitely abandoning my theory that they do the 6x up projection for some specific reason
i mean... that's a lot of budget to devote to 6 trillion tokens to end up even with models with a smaller up projection
like they had to sacrifice a ton of embedding space for that 16x
their embedding space is freakishly gigantic though
what? 3072 is small.. RWKV7B is 4096!
sorry, i thought you meant the dictionary
ah yes
maybe that makes up for the small d_model somehow
they tie the in and out projections to avoid the model being too horrifyingly dominated by their vocabulary
but which seems like uh
a really weird choice
i love weight tying personally 🙂
it's glorious but doesn't scale tho
their in and out matrices should be 750m apiece and they save 750m by tying them
but also
jesus christ
bruv you have put 750m parameters into your tokenizer, been forced to tie it to keep the budget under control, and then you only trained it on english
their embedding layer doesn't fit on my gpu
i am not planning to run their model at all i am just bemused generally
well i doubt these choices were stupid
so the 16x FFN must be a smartish deliberate tradeoff
it's a good tradeoff if it's effectively free for some hardware configuration
but i am positive that for their use case the embedding is not doing anything, english doesn't have a 250k-size vocabulary that it is essential to learn
yeah maybe at 7B, 16x FFN fits nicely on TPUs so its very fast to train
given that one assumes their bigger models are ... bigger, i assume the 6x ffn was just doing the same thing and expanding to fill all available space on a tpu
@fallen spear unrelated, i still havent tried token based routing on MoE but I was wondering do you know if there's any reason you can't make all the routing decisions up front on layer 0 with a scheme like that?
have you heard the good news about our lord and savior hash routing
(yes)
yes i mean hash routing 🙂 tho u scared me from using it for real
I want to use it for real tho
i suspect i need to make my own sandbox for that sort of thing
sorry, I meant can you route every layer the same way
or does it being kinda weirdly differently randomized on each matter
been trying to speed up deepspeed's slow MoE implementation and the routing seems to be part of the slowness
that's the best part, nobody knows. hash routing tested it and it got slightly better
it got slightly better almost no matter what they did
you could just salt or rehash for different layers if you want to do that though
ok, gotta find some time to try the hash route
if i can ever get consistent results with what ive already got 🤣
i mean my theory of hash routing is that nothing to do with routing matters except that it be consistent for some fixed subset of input that is of a large enough size to meaningfully specialize in
the idea that its related to the token makes sense, i dont think u can just make it literally random
but who knows
okay but which tokens share experts is random
yes, but at least each specific token is relevant to that expert always
consistent is the criteria that seems to be important, you have to know that X input will see Y expert again
i'll know if totally naively swapping the pythia initializations out for zero causes it to explode shortly here
ive found w/ MoE that the rwkv inits matter a lot
it doesnt explode, but it just does crummy
i mean if zero init is as good as the sticker says it shouldn't matter
ie, it should be strictly better
so exploding or doing badly are also signal
you cant just init any old thing to zero
some stuff has to be random so the gradient differs
zero init is actually a weird combination of identity matrices and hadamard matrices
because names are hard i guess
o ok
i called it identity-hadamard and it appears that it is indeed exploding
lol
if you're using this, multiply the terms in the hadamard matrix by 1/sqrt(in-dimension)
compared to their recommended implementation?
they have a recommended scaling factor etc that i have not dug into
that is the one that is currently exploding so
i strongly suspect their initialization just depends completely on, uh
a lot of things that are true of their original test and are not true of pythia
@torch.no_grad()
def linear_ZerO_init_(tensor: torch.Tensor):
# Algorithm 1 in the paper.
assert len(tensor.shape) == 2, "linear_ZerO_init_ only works on 2D tensors"
m, n = tensor.shape
if m <= n:
tensor[:] = torch.nn.init.eye_(torch.empty(m, n))
else: # m > n
tensor.to("cuda")
clog_m = math.ceil(math.log2(m))
p = 2 ** (clog_m)
in_tensor = torch.nn.init.eye_(torch.empty(m, p, dtype=tensor.dtype)).to(
"cuda"
)
had = (hadamard(p, dtype=tensor.dtype) / 2 ** (clog_m / 2)).to("cuda")
intermediate = in_tensor @ had
tensor[:] = intermediate @ torch.nn.init.eye_(
torch.empty(p, n, dtype=tensor.dtype)
).to("cuda")
tensor.to("cpu")
return tensor
their scaling factor is equivalent to mine
i might have to squint at it harder
all i can see are dtypes and vram allocation problems atm
i am not sure how long i am going to wait before i kill this run
it's soaring so majestically
just kill it, you won't learn anything
@granite plover i can't remember who else likes zero init but it doesn't work if you just drop it in
you can try printing torch.std_mean() of your layers with the two inits to diagnose any issues, if you want
I personally wouldn't bother unless I were really interested in ZerO init
i am really interested but given that it's tightly coupled to the optimizer and god knows what other hparams i should do this in a truly tiny sandbox at some point
basically: this isn't math really, but it isn't engineering either, if you try to do it engineer-ily you just get a lot of null results i think
because every remotely scaled setup is already extremely tuned and sits at a specific optima for its hparams that you almost cannot help but disrupt
nah, I don't believe that. it's been easy to find small improvements in LLMs tuning dumb things
even fern's repo got improved significantly by naut and it had crazy amounts of hparam search before that
- link, i probably have it somewhere but could use it
- that is probably true but for chasing down specifically initialization-dependent effects here it seems like it'd be a lot nicer to look at things (activations, initializations) in something much more like a sandbox and make adjustments there rather than guessing at needed adjustments at any kind of scale
if extra lucky it starts to feel closer to math and just generalizes easily
airbench #implementation-details message
fern's repo is in a strange state where giving the EMA higher resolution makes it perform worse
it could be that the EMA is not parametrized correctly, because I didn't check the math. but I found it unusual
airbench inherits from fern, who inherits from David Page, who inherits from one of the dawnbench leaders. it was SOTA from the start and each person made major speedups on top
neat. anyway, i think for chasing this hypothesis specifically i am out of remotely reasonable ideas that I wouldn't want to play with in a sandbox first
I believe it's a regularization effect. There may be one or two math errors remaining but that repo has been very finely scoured.
It's been pretty consistent for me w.r.t. regularization, the EMA definitely seems to have a sweet spot.
Is there something you're seeing that's indicating that the EMA would be parameterized incorrectly?
no, I have no evidence of any error. I just found it extremely unusual that a coarser approximation does better
now I'm thinking about the fp16 weights though...
Well it is a lookahead optimizer, so that will have some impact
I've had similar questions of coarseness at the end, and have tried a few different configs, but even at the end of training the semi-coarse EMA seems to be king
I notice there's no grad scaler...
In hlb-CIFAR10 there is! It's in the loss sum portion. Things are tuned around it.
512/batchsize is not a big number
torch amp starts at 65536, only goes down when necessary, and even rises to accommodate small gradients at later stages
Ye
my current suspicion is that EMA is sensitive to fp16 quantization
Alright
meh, it's all conjecture. without me actually putting in the work to test anything, I don't feel my guesses have value
noooooooo
what the fuck is Gemma doing? 2048 d_model with 16384 d_ffn????????
AND 3072 d_model with 24576 d_ffn?????????
THEY'RE USING GELU??? IS IT AT LEAST GATED HOLD UP
oh my god
it's gated GELU
with 8x ffn ratio
what the fuck
not 8x up 4x down. it's 16x up, 8x down.
so actually an 8x intermediate size
and apparently builds on "advances made with gemini"
so has google ablated and found larger FFN really does work significantly better???
or, given their hardware configuration, the massive d_ff is effectively free
they showed the mf 6 trillion tokens and it's like. dead even with comparable models
then wtf are the leaderboards even showing??? i thought it was comparable?
no, it is. so they didn't cripple it with the massive up projection i guess but it doesn't look better either
leaderboards are noisy and influenced by the fiddly bits with how you do your ft
so they are a definite datapoint for "at scale maybe it doesn't even matter"
huge ratio? sure. small? also fine. whatever gives you good utilization
i would think it was a stronger point for that hypothesis if we knew for sure how many tokens mistral had seen
yeah sorry. i assume the hadamard is scaled a lil' differently than the existing inits and the rest of the hparams are tuned to that scale
(ie, if mistral is actually trained in less than 6T this actually suggests the big ratio is bad here, if more, the reverse, assuming all else equal)
suddenly wondering how many layers gemma has actually because if it is fewer and assuming the giant ff parallelizes well for some hardware configuration it will have had a faster wall clock time, no?
(in fact iirc the existing pythia inits are reeeeeally small)
Gemma-7B has 28 (vs Mistral-7B = 32)
Gemma-2B has 18 (vs TinyLlama-1B = 22 )
so assuming this configuration parallelizes well on a big tpu pod they got a 4-layer-forward wall-clock speedup by doing it this way instead, i think?
call it 1/8th faster neglecting embeddings
it is "equiparams" but just seems like it would be really dependent on hardware configuration
lleme check the gemma code. because they also might be doing parallel MLP... nope. serial attention then MLP. it's such and odd bunch of design choices. some seemingly for speed, some seemling counter-intuitive
... actually, do they count the embedding params in their param count?
i would guess their training code does do parallel mlp and isn't what is released but could be way off base, maybe they just threw resources at it
because: embedding retrieval should be constant-time anyway, it doesn't really contribute to fwd pass in the same way
and they have like .75B of them
you can't train parallel MLP and then deploy serial MLP because you're missing a whole bunch of layer norms
oh you mean something actually different, i am just thinking of model parallel
no i mean attention(LN(x))+mlp(LN(x))+x where the LN is shared between them both
ahhh, yeah
PaLM did that, but Gemma does not
they're also using a key dim of 256 which is fucking nuts
tbh given the error in the gemma report about the norms i am not sure i trust any of their reported architectures to be correct
i trust that the reported architectural choices in every report were probably done at some point in GDM adjacent to a given release
im looking at their actual code, so report be dammed, this is what's actually happening
but possibly not all at once and possibly not in the released model
yeah, just thinking about that wrt palm
i mean... there could be errors in the pytorch implementation, but since pytorch and JAX load from the same parameters file and I assume their JAX implementation is defacto... there probably arent any errors in their pytorch implementation architecturally
i think it is correct that code will reflect the actual model architecture
i am just a little iffy about using palm as a reference since we only know its architecture from their published reports
the fact that they (iirc) ablated the difference between serial and parallel attention/mlp at least leads me to believe that part is correct in their report
i definitely believe they ran that experiment and am totally unsure if i should believe that anything available by api or release actually reflects it
Nope lol
Is it grouped?
yes for 2b with 1 group across 8 heads (what the fuck), and no for 7b.... wait... wait.... wait...
256 head dim... but 16 heads for the 7B model... with a hidden size of 3072?
what the fuck?
what the fuck is this model?
2B has 8 Q heads and 1KV head. 2048/8=256 so that lines up.
7B has 16Q heads and 16KV heads. 3072/16 != 256.
3072/12 is though
but they don't have 12 heads. they have 16.
overlap, maybe?
nope.
lolwut
no it is actually 256
config says head dim is 256.
self.head_dim pulls from config.head_dim
for the 7B model QKV are all 3072*4096 matricies
and they downproject them after doing attention?
yup. O is 4096*3072
that is incredibly weird and makes me wonder what they are actually doing for attn
like, is this actually a standard qkv calculation or
Woah
What is this magical starngeness.....
not only do they think i am wrong about the ff up projection they put one into their qkv too
yeah. 2 implementations are available in pytorch. eager (which is normal QKV) and Flash which is equivalent to normal QKV
i have to admire the fact that they released this and did not mention this modification to attention
i am glad at least that the qkv impl is not also an eldritch horror
this does sort of break intuition about what qkv is doing
JAX implementation looks pretty normal too. they're using the dot product attention function from FLAX, but I assume that's just normal qkv
purely abstractly this basically gives you extra heads that don't "make sense" in a normal transformer, no?
and you also get to use your output projection to actually do some logic while downprojecting
my next question is: did they mean to do this or did someone whiff their math
and if someone did whiff their math, did they end up with this final architecture specifically because this configuration "just worked better" in a way that might be attributable to the attn up/down projection
who fucking knows
Up projection but with really restricted groups too lolzies
yeah, the fact that standard qkv does ... that thing that it does just makes it fundamentally differerent, no?
I'd guess but I'm not entirely sure
gonna be fun to test when i am not on my lunch break
I'm definitely gonna try some of this stuff w rwkv
I like the fewer layers trade for ffn and better wall clock
I also wonder if we can merge these changes with the way mamba integrates ffn and attn up/down projections into a single block for extra benefit
Tho that would break my moe
WAIT... from Gemma:
We use a subset of the SentencePiece tokenizer (Kudo and Richardson, 2018) of Gemini for compatibility.
Does that mean that Gemini uses an even bigger vocabulary size???
imagine using a 512k+ vocab size instead of putting some of your 1000+ ML researchers to figuring out tokenization
What kind of 'compatibility' exactly are they achieving...
And are some of those other tokens reserved for multi-modal maybe?
this seems likely
oh yeah, i have that on my todo
hmm how so? basically you're just increasing the dimensionality of the heads without increasing the hidden state dimension
which shouldn't change the intuition for how information flows in attention
I think that generally we have too much attention in comparison to FFN
Attention takes up a lot of parameters but does no real computation over the sequence, it's just an information mixer
If parallelism wasn't useful then I would advocate making the FFN intermediate dimension as small as possible and duplicating FFN layers to make up the difference, increasing the depth and thus computational complexity of the model
kind of seems like better way to achieve parallelism is bigger d_model and smaller FFN tho
can always shrink for attention
or use subset of d_model
so did anyone tried trick used sometimes in cnns that is to use only part of the model width for time mixing layer (attention in this case) and the rest just go straight through? this way it is possible to use smaller ffn multiplier and keep attention dim low
I was talking about this to @fallen spear some time ago
usually in cnns you can do conv on only half of the model dim and it doesn't change performance much or at all but runs considerably faster
i feel like that's dependent on the conv type. depthwise conv on too few dimensions can lead to significantly lower performance because it does spatial mixing separately for each channel.
actually this works best with depthwise conv
and that is very counterintuitive, I don't know why but it doesn't look at all that you need much mixing in that dimension in comparison to ffn parameter count
I'm talking from my own experiments
i wonder if the lack of channel-spatial mixing weights lets the convolutions learn features more easily?
kinda like how I've noticed in some cases LoRA converges faster than full-finetuning because there are literally fewer variables to optimize
the same happens with large conv kernels, they learn very slow, but when you add parallel additive small conv kernel (which is completely redundant) it converges much faster, then for inference you just absorb that small kernel into the big one and compute just that
i didnt try it but i bet it works great, thats what i meant above by 'use a subset of d_model' for attention
Oh gosh no no no
Depthwise conv is horribly time and learning inefficient
You get half a kernel for the price of two!
It's a good idea in theory but really only seems to be best suited for certain cpu-only inference options that aren't 3x3 conv friendly. ❤️ :'))))
This almost feels a bit like densenets to me. :3333 ❤️ :'))))
I've noticed the smaller a good model is (generally speaking) the more tolerant it is of strange training conditions (not always though....:')))) )
or inception
Yeah good point, I forgot about that bypass
Kernel launches are just way too slow though, gotta go fast gotta do it all at once! XD ;PPPP
depthvise convs are very slow on gpus but that is rather happy little accident that this trick works so good
?
that you can just not compute half of them and it doesn't tank performance
It depends upon the problem, I believe.
I see that behavior on image and language models
Yeah I guess I could see down sampling working okay, one of the big problems is it still almost always requires a second kernel to either merge the info for passing or re-upsampling which can be, er, rather pricy in the super-efficient edge cases. 😬😬😬😬
It would be nice to be able to do it in a super efficient way though
Haven't been able to find one quite yet for that AFAIU
i think trombocyts idea is that you dont have to do any special work to merge/upsample, because the next layer ends up using both parts even tho only one half got updated?
I'm a bit confused I think. Generally you have to do a concatenate or the like in order to avoid a bifurcation of kernels
ive never tried his trick, but yeah you'd have to concat
im assuming torch.compile to remove multiple kernel calls tho
under the hood it wouldn't have to actually concat and could just work on the first half in-place, it could just copy off the original for autograd use later before it begins work
either way tho, we may just be in very different headspaces here... one extra CUDA kernel per layer is a very small price to pay for reduced compute/params from where im standing 🙂
maybe FFNs can only use a subset too, but they would be overlapping, like even-layered FFNs can use the top 2/3 of d_model, odd-layered FFNs can use the bottom 2/3
and maybe attention can use the 1/2 in the middle with 1/4 on both sides
that way we have a larger hidden state with a "message passing" area between FFNs
(Which is I think why parallel attention/FFN works so well -- I think the only reason that there's degradation at smaller model sizes is because of the useless first-layer FFN)
I disagree about this. If you're training from scratch, you can't get the same results with, say, 2x fewer attention heads
And if you try to replace the queries with trainable vectors, perf drops even further
Oh attention is def really important, I don't disagree there and its strength is the whole reason why subquadratic architectures are worse
But in terms of actually "thinking about" the input sequence, attention doesn't do anything
Its job is to mix info for the benefit of the FFN layers
I believe though that at large model sizes we have too much attention
Which is why GQA works so well
GQA/MQA basically work because of non-identifiability (https://arxiv.org/abs/2007.00810) but I think you still need approximately n_query_heads*d_v = d_model for good perf. I think anything that decreases the LHS will hurt performance since the attn block residual will have rank < d_model.
Identifiability is a desirable property of a statistical model: it implies that the true model parameters may be estimated to any desired precision, given sufficient computational resources and data. We study identifiability in the context of representation learning: discovering nonlinear data representations that are optimal with respect to som...
At that point, if you use n_query_heads*d_v < d_model, you're basically betting that the optimizer can't make good use of the extra dimensions
Maybe for a given param budget, ff is a better investment, but if you're holding ff constant, using less attn params will always hurt performance for a sufficiently complex task
Maybe we don't disagree, idk
I see your point, because GQA still allows us to retrieve d_model-length information (while decreasing d_h*n_h would limit the rank of the residual as you said), although that I think that at larger model sizes the rank of the attention residual might be < d_model
For example llama-70b, do we really need to retrieve in full d_model information 56 times in a row?
I definitely agree with this, less attention = less performance
It's not impossible, but it seems unlikely to me, because that would imply the optimizer was not using the full capacity.
Re your second point, nothing is really "necessary", but using less than the full capacity will generally hurt performance when maxing over possible tasks
I see what you mean although isn't the success of GQA proof that optimizers aren't able to use the full capacity of traditional attention?
GQA working basically means that each token needs to transmit far less information to its peers than we previously thought
Technically yes, but the capacity difference between using multihead and multiquery is actually quite small. Multiquery basically just factors all the key projections as W_k*W_rh, and then moves the W_rh term to the corresponding query projection W_qh (and analogously for the value and output projections)
I just think that for the task of language modeling specifically, we have more attention than we need for practical environments. I think this is only true for larger model sizes though, if you started taking attention layers off gpt2-small I agree you'd see large drops in perf
Agree, if you're fitting to the current distribution of tasks you might be able to get some savings. But at that point, you aren't using a general method anymore and it may underperform on more challenging tasks in the future
That's true, each token can still retrieve up to d_model dimensions of information from its peers every attention layer, GQA doesn't limit that
True I can't disagree with that, we're basically overfitting our architecture to language modeling. I don't think that's too bad of a thing to do though considering LLMs are by far the main application of transformers 🙂
Yes, so far!
i am not sure this is at all true
selectively mixing info is thinking about the sequence
Idk, is computing a weighted linear combination of the inputs really a thought process?
If attention did computational work that was actually useful, then parallel attention/MLP layers would suffer a lot of degradation
all matmuls are linear combinations of the inputs
fair point
@fallen spear finally tried the most stupid possible hash routing MoE and it did great... i literally used expert_id = token_id % 8
same on every single layer
One expert per token do it do it
Come on don't chimken out nowwwwwwwww......
Dooo it doooo it doooo it doooo it
can you salt it separately per layer and check if it improves, for curiosity?
ironically this doesn't work
?!??!
one of the failure modes of hash routing was when they made modifications that created too many buckets
the bucket:expert ratio appears to be sort of delicate
Surely it worked here, clearly and obviously it works in the limit
You must not have been brave enough
i mean i guess the issue was there were too many more buckets than experts
Oh no, a rational U curve that makes sense, ahhhhhhh
Oh
it is possible that proliferating experts is fine
the way they proliferated buckets btw was to hash tok and prev_tok instead of just tok
Here's a (maybe) fresh idea: hierarchical bucketing. Have one always on bucket. The second bucket is chosen of two. The third is chosen of four, etc.
Should bypass the information clustering per token issues a bit.
Buckets are all uniform in size.
This basically, if the grouping is done "correctly", should let the "active buckets" switch at appropriate levels of granularity per the active incoming info stream
i don't understand this at all
How you do token grouping can I believe be learned in a differentiable manner using the same exact freaking structure of fast feed forward networks. Should work similarly too, I thinks?
Binary tree where output active expert group is a concatenation of the weights from the head node to the leaf node.
Allows for hierarchically increasing refinement specific to each input
I love that this sparked u guys coming up with insane amazing ideas
Making it learnable helps reduce "wasted" information learned across buckets by forcing it to be deferred to higher in the hierarchy, but in a differentiable manner to avoid any silliness/complexity from a given scheme -- let's let it be data-defined! 😄 :')))) ❤️
Hey, moving the conversation along and trying things maybe is the secret skill
I'm honestly not sure where in the heck this is coming from myself lol
But I'm just letting it roll, ya know? XDXDXDXD 😭😭😭😭👍
Please let it roll!!!
I'm in the middle of moving so I don't understand the tree idea yet but will reread later. I love fast ffns
i need this broken down in a much dumber way if i am going to follow it
we have a token T, it hits a routing layer (which might just be the embedding layer), what occurs?
@fallen spear does that make sense now.
So like, the head node is always on. This is the always active group. The second depth layer is selected by [some strategy], preferably data dependent like being chosen based on the raw token or something similar to that (can also do like for example a conv of an n-markov chain of the input embeddings of previous tokens, for example).
Anyways, the second group is chosen based on that, and concatenated to the running set of weights as before.
I believe it should work out naively too w.r.t. the swiglu/gelu nonsense or whatever. Just basically do your own build-a-bear "Build Your own Ball [of weights]" kinda dealio.
Ok yeah gotcha
@fallen spear I guess if we're doing our MLP/whatever routing layers based upon the token embeddings, we can basically do something where we do the branch traversal as in FFN.
So, we always get the base node of the tree "for free", these are the always on expert weights.
Then we do our decision layer, and get our sigmoidal left/right branching for the first leaf node of the tree. During training, this can be the sigmoid-gated sum of the two values, during inference it can be a hard selection based upon the sign of the value (assuming no bias and all'of'dat).
This is the second weight group "block". This is kept in a list with the first weight group. If our tree is 8 nodes deep, for example, then each weight group block is of the size mlp_depth // 8. Additionally, this means that our tree will have 2^n-1 total weight blocks, each with depth mlp_depth. This can be a thing that is pretty sizeable pretty fast but also I think that's alright as it's also potentially pretty flexible space wise.
They all get added back in to the main values at the end of the linear block, so thankfully we (should is the keyword heresies) be able to independently pick-and-choose each block as needed over the course of training. And, of course, because it's FFN stuff, it gets jointly optimized which can be both a good and a bad thing, but at least it's sorta simpler engineering-wise (once it's up and running IMPE at leastsies....) so that bodes well fur scaling, I thinks.
This is exceptionally nice as well as it embeds the idea of a hierarchy of knowledge in the network, and it switches on and off as needed depending upon the input data. This feels much more similar as best as I understand to how the human brain function and how the astrocytes function w.r.t. affect associative lookups, so it at least feels to me like it's some kind of step in the right direction.
I'm still waking up and have no idea what the heck mood this is that I'm in but it's nice, I haven't had consistent "idea storms" for a few years now, so this is generally a rather confusing but pleasant experience for me. :'3333 ❤️ 🥲 ❤️
i like the additive part
one of the nagging problems with moe is that experts are very similar to each other
(also I understand this now)
ie, they contain a lot of duplicate information
Yeah that's why a binary tree is good I think
It's not perfect of course, but like you can of course adjust ratios (of layer num_parameters and branching factors, etc, etc) to accommodate for that.
Which additive part
The output projection, I thinks? I guess?
yes
What's also cool is that if you want uneven depths in your tree, or some other kind of truncation/compression, you can do so as well, at least mathematically IMPU
It would screw with batching a bit (well a binary tree expert selection mechanism would be a bit annoying to code period, but also might be quite useful as well. ❤️ :')))) )
i currently do 'always on' base FFN and then add one (or maybe zero) of the eight experts to the result
so this would be like a further set of levels of what i'm already succeeding with
Ah, gotcha, thank makes senses to mes, I thinks. ❤️ :'))
So this then is basically just stretching it along the tree traversal route
Since some levels of weights are probably somewhat group-specific, but aren't entirely "all or nothing" like an always on, or some extremely group-specific weight impe....
recently this has been as a continuation from an already-trained RWKV model w/ pre-existing FFN
so this adds onto it, starting with all zeros and slowly learning to differentiate
seems to work great
This in this case being....
what i've been testing with the 'always on' base FFN plus additive MoE on top
Now I'm a bit confused lol
For me, the always on bit is weights that are basically always used with no gating
yes, the always on is the pre-existing RWKV FFN (we call it chanmix)
Then an FFN is used to progressively select the conditional branches of weight layers
there is only one branch im using currently, and it can be selected using gating or hash routing... hash has been just as useful
Oh FFN here is feed forward network
Not Fast Feedforward network
yeah sorry not FFFN
Yeah I think that was my bad
Ah gotcha yeah
I tried other ways of combining the base FFN and the chosen expert, but additive seems better maybe especially when starting from an existing trained model
The nice thing about token specific selection I guess if I'm understanding this correctly is it makes a lot of the "predictive" (or predictive related) aspects of things easier to manage
And here, I'm meaning that the outputs of the two are added together
Oh, I guess there's the sigmoid gated additive interpolation or whatever
yeah my recent implementation is like
out = x + ffn(xn) + moe(xn, tokens) # tokens passed so i can hash em```
And theoretically that could be fused into one kernel without breaking the bank, I thinks (maybe a bit moar complexity thosies....)
Buh it seems to be hyperscaling friendly basically! 😄 :')))) 👍
well you're missing some stuff like Expert Parallel requring some comms in the middle so hard to do that in practice
Yeah I was thinking with token-based stuff you could avoid that some, however, that does not really get around the kv-cache issues all that much does it, really.
we have no kv cache issues (or kv cache at all!) in RWKV, but to fit MoE into VRAM for training we gotta go expert parallel so each GPU can hold e.g. one expert
Ack, gotcha, didn't know you were team RWKV. That makes sense to me, then. I got the distributed expert part at least.
Variance of expert selection has got to be a bear, lolz
I would like to propose selective expert leaf dropping as a variance/routing management strategy in the binary tree case
the binary tree doesn't work with hashing, right? just learned?
One could make the block sizes get smaller as you go towards the leaves (or maybe just keeping the tree not too terribly deep), but leaves maybe could be dropped in favor of a "generic" set of weights learned as a backup in case some batch inference scheduler determines that it would be too costly to run some kernel for a horribly miniscule amount of nodes
It's partition based so if you can make a function that generates partitions from a hashed item then you should be good I think
seems pretty ez to achieve
might be a bit slower since i gotta do several levels instead of one, but i like the idea
The tree does include some notion of structure though so I think any binning strategy should try to take that into account
Otherwise it's just throwing darts in higher dimensions thensies. ❤️ :'))))
I also would like to try switching out the experts with LowRank approximations and/or Butterfly Matrices
I think it's okay if I just use one bit of the hash for each level of a binary tree 🙂
the main thing seems to be to have the same 'expert' always be consulted for a given token
Yeah, I'm not sure what informationally that will bring
so this would still keep things consistent even w/ a tree based on each bit of the hash
it doesnt need anything informationally, it just allows us to store more info in fewer parameters
😬
Lemme think about this
Because hashes are supposed to be rather spicy in how they jump about
yah thats their goal, but in MoE apparently the only important thing is that the same expert is consulted for the same token always
Ideally, (intellectually, which is def different from experiments as we all know lolz), one would want a higher node in a tree to correspond to a token group
so as long as token X always goes the same route through the tree i think we're fine
Yes, this is def a step forward but then it results in the information duplication problem!
The tree is meant to dedupe that information by clustering it into hierarchical nodes
So that way, we have token routing consistency, but then also similar tokens get routed through similar branches of the tree
Hence, more "free space" to learn various concepts
The problem of course is that this is data conditional, but thankfully if you do it based on the token itself, then you can just skip the routing function entirely and split the learned embedding up to route it. I think? That should be efficient?
I don't know if I believe in the 'similarity' idea any more... @fallen spear dissuaded me with this hash routing stuff
because if similarity mattered, hash routing shouldn't work as well as learned gated routing
but it does
routing_table = nn.Embedding([num_tokens, tree_depth*total_num_layers]) routed = routing_table(inputs).sigmoid().view(tree_depth, total_num_layers) # optimization hack: repeat the branching strategy on each side of the tree when weight averaging. Should converge albeit being a bit strange
Here's an efficient input-token-dependenr learned routing table that uses a routing symmetry approximation to make the process much simpler (and doesn't need an FFN, to boot!)
Yeah, that is a very interesting data point to me
Like it feels like one that def needs an explanation
(not saying you have to give it, I'm meaning it seems like it's something I'm curious about sorta)
it seems that the reality of the situation is as follows:
a FFN has the capacity to learn a certain amount, but it can learn how to deal with lots of different kinds of inputs (proven by how we use them traditionally for every input!)
so if you add more FFNs, you just gotta make sure the inputs roughly match each time so that you're now spreading the computation across these, allowing you to train and inference as fast as before but with much more parameters available
Just because duplicating knowledge feels kinda like a Bad Thing, especially if we can control things relating to it
Yep
Having consistent routing is def good
Would be insane otherwise
But one problem is that the amount of paper parameters and the learned parameters are a bit different I think due to the network having to double, triple, quadruple, octuple learn things, etc
Consistency is def good because it keeps a match between those things
But a question is how to factorize speedily the process so that shared learned things sorta all generally live in the same place, as it were.
oh also we could reduce the size of the FFN at each successive layer
because it's handling fewer tokens
Yeah I was thinking about that for the leaf nodes in the binary tree example
I could see there being distributed nodes that basically fall back to blocksparse as the leaves themselves can get rather specialized
Which is interesting as it sorta ties throughput to useful model capacity, with tons of throughput, you can add more and more leaf nodes to the binary tree without as much direct negative impact iiucsies
would be real interesting to add half size each level, so you end up linear param ct in the number of layers
i guess that kind of kills the benefits of GPU parallelism if you keep making it less and less parallelizable
Yeah that would be interesting
That feels friendlier to scaling people at least lolz
Because eventually nodes can handle multiple layers at once (and you can colo them!!!!!!)
The halving is intriguing and I wonder how well it works
yeah its crazy
im interested by it
but i dont know how to implement it so its not super slow
Does the embedding routing trick from earlier make sense
whats the difference between tree_depth and total_num_layers - is total_num_layers the layers in the model?
This would predict a routing tree for every MLP basically at the start
That might even be redundant
But I'm not sure and wanted to err on the side of caution
With the depth halving one could maybe just calculate all of the experts at once and then fold them in with a shallow depth tree (say, depth 3-4 or something like that)
To avoid linear layer weight madness
OH i forgot that each layer of depth in the tree is independent 🙂
its all just additive, right?
so its super parallelizable in some sense
Yeah
coooool
So you can sigmoid post hoc
To basically get the same behavior
But keeping the weights dense
that's like FFFN
Yep
Oh haha also I think you can use 2:1 structured sparity for each leaf node by just interleaving weights together and selecting one of two (or just inverting) the binary weight masks, maybe that's silly but I think it's hilarious.
Maybe specialized solutions would be better
But your halving idea made me think about that
They can always be loaded interleaved as one "set of weights" in memory, then the mask can be shifted up or down by one to "let through" the proper outputs for each item lolz
so basically (for the 1/2 size each level idea) for a tree of depth D we just compute D normal sized FFN's in parallel, right? then we can select a subset of the results based on the routing table
Yes exactly
so cool
Wow I felt really giddy when you wrote that lol
This is really freaking cool isn't it ':DDDD
I really love it. Feeling giddy too
I have no idea if it works hahahahah
But it's really awesome
after considering it for a minute, my guess is that just like FFFN, it's only really helpful at inference time
I think a good blending strategy would be to apply the structure of the routing table to the tree to mix things together. So for example if the node structure is .1 .8 .2 during training, we can apply this structure to each node. Because we're not predicting the entire tree, only one index.
This has an advantage of not taking up much space, and also oddly enough it implies structure across the tree earlier on in training, which your hashing stuff shows us I think shouldn't be catastrophic for the model (and it should basically factorize out as the branch selection gets more and more popular)
Yes I am very much enjoying this collaborating
Yeah though you could lock in branch structure progressively XD
yeah i have a [very frozen] community project about doing that branch locking for FFFN
Probably with some small penalty by having learned shared info between experts, but as you noted earlier, if the network can take the abuse of hashing, then a little bit would be not too terrible maybe
I have a feeling it will be a logarithmic kinda thingie
I think the key is just to make your choices early on in training and stick to em
Yeah that works I guess
I think as long as the bulk of the higher level conceptual layers (like higher up in the tree) shared between sub-nodes are generally informatively positioned, then that's more important than the structure of the particular specialist layers
I'm trying to think of a 'fair' test for my existing code that would add a second layer and work somewhat in this fashion
i guess i can use 4 normal sized experts at depth 1, and 8 half sized experts at depth 2 (to total the same as my current 8 normal sized experts)
well currently I choose 1 of 8 1x sized FFNs and add the result to the depth 0 1x sized FFN
so I was going to change that
from 1FFN + 1 of 8 1FFN
to 1FFN + 1 of 4 1FFN + 1 of 8 0.5FFN
more compute, but same parameters
Oh okay, gotcha, sorta b-tree kinda-like business goin' on ovah hier. ❤️ :')))) 👍
yeah only bc i need some baseline to test against
and it started life as an 8 way b-tree of depth 2
See you invented this first, you have a Schmid window

I'm sorta a little excited. No need for details if there aren't any (sorry if so 😬) but I'm sorta curious about experiment plans if there's anything at the moment. Maybe I should look into trying it out under hlb-gpt at some point toosies, seems like a decent fit for it. ❤️ :'))))
o sorry yeah I don't know when exactly I will get to do the experiment I outlined, let alone more complex ones
would love to hear if you do
and if you need MoE code I just wrote some new very short code that works within deepspeeds MoE stuff
but tbh you're better off testing it without all the trimmings, in which case its ez nuff to just write (without expert parallelism support etc)
If there's any good raw PyTorch code I can take a gander, otherwise I can write it myself
It doesn't seem too complicated, especially as the routing embedding simply assumes that the switch controls every pair of branches the same way at each layer
yeah the only tricky part is that the switching has to be per token
and yet you want to operate in parallel
Yeah I'm probably gonna be lazy and just post hoc merge
but you can just do all the calculations and choose which to use
But I'd like to maybe take advantage of 1:2 / 2:1 sparsity whatever sparsity if it's not horribly borked, to do the leave interleaving trick
Oh. Hm. Maybe there are some fun ideas heresies. My experience leads me to believe that a universal, singular kernel launch is faster, but stills.......
Hm
well u can experiment with it being slow and crummy and wasteful and if you find something here is good we can then worry about an efficient implementation
as long as its fast enough to run some good tests
Yeah exactly
I've beat my head into the wall too often looking for clever fast implementations when sometimes just trying a bunch of dumb, (or maybe better yet, "dumb") shit works out pretty well in da endsies. ❤️ :")))) 👍
@still grail i was thinking some more about all this, and got to thinking about the relationship with DeepSeekMoE where they split up the FFN experts in to a zillion tiny parts which they mix and match
seems like our tree approach has similarities, except that we were going to choose certain wider experts as lower levels of the tree, where deepseekMoE is more like a shotgun, choosing say 50 out of 1000 same sized tiny experts
in the limit, deepseekMoE of a d_model x d_ffn x d_model FFN becomes d_ffn different minimal d_model x 1 x d_model FFNs
so what if we hash shuffled the actual minimal ffns, such that out 1000 we choose say 50 out of which we create a dynamically generated d_model x 50 x d_model FFN
effectively, each token gets its own custom FFN made out of 50 of 1000 minimal parts
this seems ideal but maybe very slow to compute
so maybe the idea is the tree approximates it but is less tricky for GPUs
you could also just reduce it by applying consistency loss between experts
i'm listening
yeah that is an interesting approach too. the overhead maybe then moves to like you said how to mix nad match all of the various experts there
there are old papers that do this with weird nonstandard architectures etc
it works okay
well i think using a tree manages that expense via a tradeoff using its log-ness of depth vs width at each level
but if you enforce a consistency loss between experts it forces them to share information
you can also have
a lora
it can even be full rank
but
one of the matrices shared and the remainder experts
splitting the FFN up a la deepseekMoE is essentially decomposing it into its constituent LoRAs
there are a few sort of clear ways to reduce duolication
agreed def
i had to go pull this but i remember this one
i disagree
lora can be full rank and the interaction between the decomposed matrices is multiplicative
these run "side by side" with each other
how come? the math is identical isnt it if you split apart all the middle layer neurons from a big FFN into their own tiny FFN and sum the results of all of those tiny ones?
no, this concatenates the smaller matrices, lora matmuls them
maybe we mean different things when we say LoRA
possibly
i really just mean a bottleneck 'FFN'
LoRA tho is typically adding the results of such a bottleneck to the base value of a full FFN
yeah this is true
sorry, I guess I'm abusing terms when I say LoRA and mean just a bottleneck
all I was saying is that a d_ffn wide FFN is made up of d_ffn 1-wide bottlenecks, summed
and that maximally, this is what DeepSeekMoE does as a decomposition
one of these is shared over all experts
one is not
presumably shared knowledge will end up in the shared one
'each ffn' as product meaning each expert?
sorry, are you saying that the expert's weights should be the results of two matrices?
yes
ahh
and one of those two should be shared over experts
the mixup was that I thought you were referring to how to calculate the expert, but you were referring to a kind of generative fast-weights idea of how to create the actual feedforward weights themselves rather than learn them directly
tbh i have never understood why there isn't more weird matrix factorization going on in general
we have a very solid proof of concept that in one specific case the technique gives you vast gains in efficiency
I think it's that it can create a lot of potentially unwanted instability over the course of training
Depending on how it's done
well this is pretty out there... i mean, i havent seen any code ever that learns weights indirectly like this
presumably due to stability reasons
in other words, i love it
i love this explanation thank you
not knowing why something isn't done makes me think I should not do it
oh wait, i thought that was supposed to be a double negative... does it make you want to do it or not?
🤣
it makes me want to do it, not knowing makes me assume that there's some good reason I should not that i don't know yet
instability is an interesting problem here
speaking of
I'd say go for it if you want if you're wanting my personal perspective on it, you just have to keep in mind that having weights generate weights as a function of other weights will sorta make the Jenga tower a bit more rocky unless you're rather clever somehow with it. :3 ❤️ :'))))) 👍
i have given up on using neox for ordinary experiments, it is just really not hackable enough
use my gptcore 🙂
its specifically designed for LLM experiments
everyone who has used it loves it (that's not many people tho, and im sure it can be improved)
isn't it on an rwkv-family architecture?
nope, it supports general transformer-like stuff modularly w/ high configurability
welp
i dont actually specifically want you to use it, it just seems convenient for the task at hand
since it was born out of my frustration w/ all the existing tooling
i have it and two other things bookmarked as maybes
it definitely has deficiencies.. for example its not designed to load existing weights particularly from other projects
its just designed for doing experiments w new architectural changes
esp on consumer hardware
i think i am definitely in an experiment running place, i will probably fiddle around before i settle on what's comfortable
i have uh
i have caught an excess of thing to do in general so i am probably out if commission at least another week
the other two things i know of with similar are chili and neuralink's thing and another sandbox someone has
lemme get off of mobile
cool, link me to em when u get a chance id like to check them out
my excess of things to do is mostly to get RWKV MoE working the best it can so I can train it
yeah i have uh a completely separate day job that is currently trying very hard to snag all my attention
https://github.com/huggingface/nanotron this one for sure
i have another one somewhere
https://github.com/HMUNACHI/nanodl this one in jax, but that's not the one i am thinking of, which I have apparently lost
hmm nanodl looks the most relevant, but jax and not focused on data based configurability and ease of running
@dawn vine i think i figured out a good general landing zone for a dynamic weight communication motif
(whoops, sent it earlies deresies 😬 😬 😬 😬 😭 😭 😭 😭 )
i think if'n we do indeed try to lean on the input token (or something similarly-fixed like that) as the keys for choosing the experts within some paradigm -- if that holds up in some way (which i think it should, even based on the hashing business that you were doing earlier), then all methods that rely upon that for dynamic weight generation i think can do AOT lookup and communication if the function combining those values is linear
I know that's for dynamic weight generation which is a little different
u can ping me any time ill have it off if im asleep or whatever 🙂
but being able to do it AOT should hide some of the latency of communication a bit
just simply because the only dependency for generating weights is "input token", and so that means it's easier to do all of the layers sorta all at once instead of with a sync each time
might not be useful but just something i was chewing on w.r.t the distributed side of the dynamic weight generation side/part of the problem. :3 🙂
that's a great point... not sure if comms can happen earlier or not, gotta consider that a bit
like either way u gotta wait for the prior layer/block to complete so you have the embeddings to transfer over
but yeah at least you don't THEN have to operate on them before knowing where to send em
i skimmed a paper that did a kind of pipelined work to allow the comms to get pipelined as well
i assume they split up the prior layer's work into chunks, so that each chunk could start getting sent across the wire when it completes
The Mixture of Experts (MoE) model becomes an important choice of large language models nowadays because of its scalability with sublinear computational complexity for training and inference. However, existing MoE models suffer from two critical drawbacks, 1) tremendous inner-node and inter-node communication overhead introduced by all-to-all di...
@dawn vine @fallen spear still working on this, definitely coming along the train of thought that 'switching is more expensive up front even if it gives more capacity', I've visited the idea of branching/cloning weights in the past so I want to revisit/keep visiting these things.
As far as the binary tree stuff goes, I'm still exploring this! Trying to find a semi-efficient proxy that's maybe's a halfway solutions heresies.... ❤️ :')))) But I am definito makin' some progress.
Right now I'm exploring something semi-completely different that I've visited a number of times over the past few months, some of @fallen spear ' comments keep spurring me to look at it again.
Some promising initial progress on that, but no word on it yets. ❤️ :'))))
Cool, I've been temporarily distracted from MoE by other stuff but eager to get back to it!!
I finally tried this, and it was a little better
Unclear if that's because the new hash function was just luckier or it actually matters
I multiplied by a prime number for each layer: expert_idx = (token_idx * primes[layer_id]) % num_experts
lines up with original hash layers paper fwiw
cool, good to know!
i may really use this in production lol
kinda scared
i also have a maybe even more nutso routing mechanism (like how I actually choose where to send each token, not just what the hash said to do) thats like beyond stupid
but as a result its fast and wastes no 'capacity' 🙂
coming soon to a RWKV near you ™️
ive been revisiting but i freaking keep getting pulled away on related tangents lol
like its turned into an entire freaking pre-release for hlb-gpt that has nothing to do with moes (not really at least, lolz)
and for some reason i have cleaned up the entire freaking codebase and annotated it more
this has be about one of the strangest tangents maybe i've taken in my life but well, here we are lol
hopefully it pans out, doing tuning experiments now that wikitext-103 is back online :3 ❤️
lol! okay well sounds productive in some direction at least!!!
i was all "oh shit, there's a different wikitext" and it was just. the same one i have
there was... a hunt for it (the raw version at least)
fighting the urge to buy the wikitext-103-raw-v1.zip domain just to serve it
haha nice
i love the community redundancy so freaking much, lolz ❤️ ❤️ ❤️ ❤️ :'))))
sorry @still grail I figure this is maybe the right place to discuss MoE-like and FFN size related things after all 😁
anyway I was interested to see you chose 2x ffn in HLB-GPT as the default
I'm about to embark on a larger 1.5B training run for my RWKV-6 MoE using hash routing
seems to work great, and I use a somewhat unusual configuration where I put MoE as a second FFN after the normal pretrained one, but additively (it starts at zero contribution and I use this to continue a pretrained non-MoE model)
would be great to hear where you landed on your MoE experimentation and if you ended up with any of the tree-like stuff in there
and I think this is all interesting to consider relatively with regard to Mixture of Depths
hihihihi! i'm on a quick brain break from doing some intensive hlb-gpt dev, i can give some light answers now and maybe some more involved answers later if needed
cool, thanks!
you might rock your own socks off at this, but this is actually technically a 1x parameter! 1x for local, 1x for remote. I couldn't find a configuration that performed better, twiddling the v_dim and the expand_dims around (univariate + multivariate experimentation case) didn't seem to do so well. some of that might just be linear/blocksize stuff/whatever, some of it might be something else, it felt more like a 'something else' than a raw hardware efficiency thing as runtimes seemed to be not too dissimilar between them.
The one direction that did seem to have the 'lowest loss' of performance was increasing expand dim, and I think increasing v_dim? However, the ratio between v_dim and the qk_dim should always be 8, the dim reduction parameter ties the qk_dim to the normal dim so you might have to fiddle a bit there if so
this network seems to really, really, really like going a bit narrower and deeper for some reason
I did! that's how i ended up on the hlb-gpt release, oddly enough, lol, ended up experimenting with efficient layer layouts as I couldnt' find a super fast MOE solution, and here we are
I did do some initial tree stuff, but nothing too complicated, and it didn't pass my complexity and/or speed requirement test, so I put it back on the shelf until I could think of something with a similar vibe that was super efficient
ive tried training on stochastic depths just as like a warmup kinda thing (had some serious bungee-cording vs the original, 'train all layers at once' strat), don't have completely thought out thoughts on that but it is very much an interesting topic for suresies.... ❤️ :'))))
oh for FFN you just do like d_model->d_model ('expanded' 1x)->d_model? im missing something about the meaning of local vs remote in this context
and you use a truly giant V eh? thats interesting i never saw that give me much gains
oh maybe this V size vs QK is related to not using attention heads!
This is a fused mlp+attention model, we do attention in the nonlinear latent space of the geglu! and some proportion of it is assigned to local-only (no attention) and some to remote. mixing and matching seems to incur a computation penalty with little gain, though there may be some methods/avenues out of that. Not entirely sure!
ah I grok your meaning of local vs remote now
Yeah I have sort of a suspicion for this, I feel like the nonlinear stuff is sort of doing the work of information containment that an attention head would do, just maybe more effectively as still the attention heads are linear? maybe?
I'm honestly not sure, that thought is still like 35-45% thought out or whatever
the local vs remote part seems to me related to the stuff @boreal moss had discussed with me and then here earlier about splitting off a subpart of the embedding to do attention on
so that you can do 'smaller attention' on a huge embedding
yes it is also exceedingly computation efficient
yeah i love that part
many, many fewer kernel launches, and im guessing the training stability is much better too as you're not having to forcibly pass things through a longer residual for certain kinds of operations
though I don't have a good example at hand that shows that exactly
just sort of, if you had some kind of feature that required 3 attention operations or whatever, having to pass through 3 residuals instead of 6 bodes extremely well for not just computation time, but also training stability I think
that's again a bit arbitrary and not necessarily grounded in the reality of what's happening there, just trying to paint a bit of a rough picture of some of the reasoning there
yeah i tend not to care too much about kernel launches bc I pretty much require torch.compile which does away with that as an issue if done right
but stability thru fewer layer-like things could certainly be a benefit
and it still is probably quite a bit faster since it can use GPU parallelism more fully
well some of my experiments are just like ~40-100 seconds or so which is less than a lot of compiles so for now at least im staying flexy (am experimenting with some potential partial compiles for the longer runs tho
yes, a fused kernel for this would be absolutely unreal
yeah hey im very into rapid iteration - i feel like no one else but you and I like that in the ML world 🤣
i have no idea why, the days where I run 4 long experiments vs 200-300 fast experiments are like so night and day different lol
agreed 100%
long as the proxy scales (and my workflow is built to try to make sure that it does)
yeah that part im unsure about tbh
there are a ton of things ive found which work amazing until like 1-2GTok in
you can interleave the geglu layer downward projection + add to residual for example with the attention layer, which I think would make it even more memory efficient, however I am not entirely sure about this one
yeah
well part of this too I think was thinking about the transformer scaling issues
the linear attention value I think doesn't scale precisely because it is linear
it can't suppress stuff
so in order to do so the network has to absolutely spike
and only the large models seem to handle that
(also a bear for training stability)
are you talking about linear attention? or the value of attention in traditional softmax dot product self attention?
linear values
at least that's my hypothesis for why the spiking occurs
i really should check to see if that happens in these networks as well
we dont find a lot of problems in training stability for RWKV btw
that's lovely
theres a lot of normalization that goes on which may help with that
and linear_attention/linear transformer is just kind of fundamentally different too
that's a fair point, and something i'm sort of scared of honestly. only in the 3.2ishB token ranges for now
your setup is a little more like mamba in a way where they expand to get V, do SSM style 'attention', then contract again - they dont have a separate FFN
well also, the data-dependent gating seems to be an interesting/novel thing, i was reading through your code to try to understand it a bit better, and the double-lerp "check if trajectory is here based upon some data-dependent scaling, then scale the delta in trajectory by some lookup value if so" (if I understood that code correctly) method seemed really interesting, for example
gosh I need to get over the SSM hump
I feel like I might become a state space model fanatic if I understood the whole general idea/had a good toolkit behind it. Sorta gotta still work on wrapping my head around that.
you can read the paper if its easier in the RWKV-papers channel or preprint should be out Monday 🤞
i could give it a go! I basically have math symbol dyslexia or the like so if it's heavy in that then I may just be best off sticking to reading the annotated code, that made sense to me (though it is certainly a new paradigm to work through! ❤️ :')))) )
im not amazing at math symbols even without any dyslexia component and I too prefer code if it's simple and well written
SSM is real simple conceptually if you need a quick explainer...
took me a while to understand from the papers tho, theyre a real bear
at some point that would be good, my brain is fried from coding right now 😅 . maybe later would be really cool! :3 🙂
sure, anytime 🙂
re RWKV-6 I tried to add some other explainer stuff to the paper too so maybe its readable enough even tho its not code exactly - and the recurrent formulation math looks just like two lines of code, which is nice
ok so to clarify my understanding, if written unfused you:
att = attention(W_att_a(x) * GELU(W_att_b(x)))
ffn = W_ffn_a(x) * GELU(W_ffn_b(x))
out = cat([att, ffn])
and W_att_a etc. are all (d_model, d_model)
yep! (ignoring the q and k of course here)
cool, did you try using different ratios for the ffn side vs att?
yes! This is what I was talking about with the v_dim business up above here
ah i see
#1169741769232089169 message
The one direction that did seem to have the 'lowest loss' of performance was increasing expand dim, and I think increasing v_dim
the way the v_dim is written is to try to encourage the user/etc to think of the whole thing as a shared space with different allocations for different places
yep
they all seemed to perform not quite as well, but that one seemed to have the least proportional hit to the number of parameters being changed
(i think i resized the net though to make sure the overall number remained roughly the same, not sure tho)
yeah so this gets back to the topic of this thread, which is what the proper expansion ratio really is
yep
it might be different for different things
at least for the 125M model as well, 1+1 seems to be holding strong on the more information dense wiki task
but maybe for like open pretraining, it will need a wider ratio
but it really does like going deep
the 125M model is like 28-32 layers deep lol
i kind of think your way of doing this is better and is like what mamba should be doing
XD
mamba is also on my long todo list of at least roughly understanding XD 😭 😭 😭 😭
they end up doing 'attention' on the fully expanded version (2x or whatever) so its slow
oh interesting
yeah at least for the keys i dont think its as necessary
which honestly makes sense, you do a lower dim check to see if you should move the activations, then the activation moving should take up the bulk of the work
if your precision on the lookups is super high then relative to the amount of work we're spending on the actual activations themselves it seems a bit wasteful
okay, gotcha
oh dang thats tiny
wow
so basically stacking lots of local stuff
similar to the shifting business
(like you were saying earlier)
yeah same idea, Bo uses it to allow there to be induction heads in a single layer
but its not relevant to anything we're discussing particularly
just a cheap added boost
gotcha
but you can see there how they expand 2x (or whatever size) up front
well it sort of looks like you're forced to keep the higher dims in this motif, as the sigmoid gate post-SSM is also in that higher dimension
in exchange for this they use 2x the layercount btw
bc they no longer have a separate FFN
2x or half?
2x!
oh
interesting
i did try a no-ffns motif
it honestly was not too terrible
not the best but really not terrible
having that split seems to be useful somehow
yeah i eventually got this style to work fine but it was slower
(and maybe cheaper too XD)
I would love to see this split latent space idea come to recurrent networks
happy to add it to RWKV going forward 🙂 just cant figure out how to shove it into my MoE which would be the easiest path
ive been thinking about the MOE+this layer version of things
maybe its just as simple as a silly expert gating thing
maybe not tho XD
well i have VERY simple crazy code now for moe that doesnt do much work
I guess jsut for the local side of things? XD though the nonlinear layers can do computation in transit too
yeah MoE on attention is nearly impossible
yeah i haven't played around with it. I'm assuming at least on the qk side of things, is it also the V side of things that gets weird?
im trying to come up with a succinct reasoning 🙂
it would make sense if its required that there be a fixed path for certain things, i could see (very loosely/roughly) how the training dynamics could get really weird if not
attention is looking back at past tokens, but if you make those tokens vary depending on the expert, thats... hard to imagine an implementation
yeah i was thinking about it sort of from the ODE-like interpretation of transformers
if you think about it as a process slowly moving vectors from one place to another, and each layer is specialized at one part of the process, then you honestly would have to switch them all at once, or not at all
and, the other layers sort of depend on that together as well (the mlp layers i thinks.... ❤️ :')))) )
I think the traditional MoE style of doing things assumes a very particular style of 'do XYZ routing locally per layer', which i don't think can (easily, at least) capture this more global kind of context to it
so i guess from that, I feel like moes on attention could work if done in an 'all-at-once', unified kind of manner
as you are at that point mainly just making a '1-of-n' selection of possible trajectory sets over the course of the entire network, from layer 0 to the output layer
i dont think it'll work otherwise really, as the local moes for the attention layers will be nearly almost nonsensical w.r.t. each other
that being said, i dont think that version of an explanation really works all that well for explaining how the FFN moes work, or would work w.r.t. a more globally-sliced attention MOE
(if that makes sense at all.... ❤️ :'))))
I'm sure there'd be some interesting set of slicing dimensions where you could slice almost vertically across attention MoE layers (subsets dimensionally, across the entire vertical depth of the network) instead of horizontally (big giant chunks at each block?), if this motif worked?
i think i gotta spend some time adjusting my brain to thinking about time-related MoE before I can have a reasonable thought about it 🤣
It might not, but I feel like that would sorta at least be a semi-required starting point unless there is some fantastic trickery going on here
fair enough, XD these are just the scattered ramblings of a slightly-woozy, highly-caffeinated researcher XDXDXDXD
getting back to this flowchart for a second you do your gating before the attention which is an interesting difference
there's a paper where they gate an entire layer ahead of time
we do ours afterwards, and that's kind of more standard now
it worked fine
they did this for resource reasons, it gives you time to fetch the expert from cache
pretty interesting!
so you make your decision, perform other operations and run your expert un-caching in parallel
rant_about_hash_routing.jpg
most things work, routing is black magic
I think I don't understand what's meant by gating the layer ahead of time here!
(like, question from me: during training, or during eval/inference?)
the routing layer is a full ff layer earlier
so it runs iirc
gate -> some_ff > attn > gated_ff
but for every ff
i might have that paper somewhere, not sure
they did this specifically because they were trying to use an moe while very memory constrained so the extra time between the gate and the gated ff was for fetching it
especially: allowed them to have more experts in colder RAM
no wonder i keep forgetting golang trivia i am just filling my entire long term memory with this sort of thing
this is an inference time concern to be clear
when running non batched mostly
in batch regime you need roughly all of your experts so this gets you nothing
O ok I am mostly thinking about train time
still I guess this has bearing on when gating can occur profitably
yeah they were running some model under fairly absurd resource constraints
i would be surprised if it matters a lot where you put your gates relative to the MoE layer
Mixture-of-Experts (MoE) networks have been proposed as an efficient way to scale up model capacity and implement conditional computing. However, the study of MoE components mostly focused on the feedforward layer in Transformer architecture. This paper proposes the Mixture of Attention Heads (MoA), a new architecture that combines multi-head at...
they keep K and V the same across head experts, but allow Q and O to vary, to make it feasible
conveniently, this setup works just as well for linear transformers!
@rose vapor wrote a paper on GLU variants: #research message
I personally did not find sin to work in any circumstance in LLMs
it's interesting it was the best one tested
yeah saw that, was curious
@rose vapor did you ever try FOUR layers? (three multiplies)
the specific format he chose, I think would also result in gradient explosion in my tests
because the oscillations become very steep away from x=0
oh, he only tested applying one activation but not 3, rip
I didn't have the budget to try other data modalities other than vision, but I provide a code snippet to implement the SinGLU function
I suspect the results I got are probably very dependent on the data augments common in vision transformers such as label smoothing etc.
I'm not sure what the equivalents are in language, sorry.
What do you mean by FOUR layers?
x1 * x2 * x3 * x4
Ahh, nope
Just 1 - 3
With only one being passed through an activation as this is what SwiGLU does under the hood
yeah we tried a lot of this stuff early on in this channel on language modelling, you can probably look back and see it all
Awesome, will do
his test results are also interesting. this one sometimes does well
Yeah the results sort of couple by number of matricies. I suspect this is because of the general shape of the output
With 1st order GLUs having a linear quality and 2nd order being sort of parabolic
Which order works best depends on the activation, with 2nd orders working well for Sigmoid, but 1st orders working well for Sin and Tanh
With sin it's easy to understand why. The first matrix does frequency modulation and the second does amplitude.
I suspect the Pre-LayerNorm is crucial for SinGLU to work. I have numerical experiments showing this mostly kills oscillations in the loss landscape
so you managed to restrict the input of singlu to a region around x=0. that prevents the huge gradient oscillations
Exactly!