#d_ff/d_model + swiglu tests

1 messages · Page 2 of 1

dawn vine
#

I can share these extra experts across layers, per your idea (that everyone seems to want to try)

fallen spear
#

it does seem like a free lunch

dawn vine
fallen spear
#

none of us remember what it is

#

it was probably a while ago

dawn vine
#

thanks!

#

still some question of whether it might be useful in the context of DeepSeek slicing tho and plus your all-layers idea

fallen spear
#

it is probably worth revisiting. since none of us remember which paper it was it is probably kind of old

#

it might be in the bibliography in the moe reading group channel somewhere

dawn vine
#

maybe a better use of parameters would be how they do it in Mamba, where they unify FFN and Attn into a single block, then double layer count

#

then you can apply the experts to both at once

#

(expansion and contraction matrices constitute an 'expert' here)

fallen spear
#

if you slice it maximally your "experts" are all single vectors and a single vector is agnostic for all purposes except magnitude

#

i shouldn't say magnitude, i guess i mean 'dimension' -- how many scalars are in it

dawn vine
#

yeah its true fundamentally 'where its most useful' is kind of a perpendicular concern to how MoE is done
which is nice

fallen spear
#

just, in principle if sharing "experts" around you can put them anywhere they are correctly sized

#

whether it is a good idea is more difficult to determine

dawn vine
#

btw, I know you're hardware challenged so maybe this doesn't matter, but in case it proves useful it was hard won knowledge for me: don't bother running MoE on multiple 4090's unless they're NVLinked - you have to use the cards Nvidia doesn't hobble their cross-card bus comms, or the communication will destroy your speed completely

fallen spear
#

i'm on 3090s now, nvlink is slow to show up, my primary current bottleneck is being easily distracted by devopsy stuff

dawn vine
#

well any consumer level card they make all the comms go across the CPU or some crazy thing, and it makes MoE untenable across multiple cards

fallen spear
#

yeah, unsurprising. i guess: unless you have a full copy of every expert on each card

#

but then it will have to be very small. etc

dawn vine
#

yeah but the whole point of multiple cards in this context is usually to use Expert Parallel

#

so you can fit it all in VRAM by splitting em across cards

fallen spear
#

yeah ... idk, the scale at which moe seems viable leads me to question doing them on less than an a100 x4

dawn vine
#

agreed

#

this is one reason it may be cool to use the cross layer experts tho

fallen spear
#

the itty bittiest moes are still monsters of vram

fallen spear
dawn vine
fallen spear
#

yup

#

imho vs deepseek hash routing's multihash is "cleaner" as an idea

dawn vine
#

instead of MoE we should call it slice-and-dice FFNs

fallen spear
#

but they barely mention it

#

it is a frustrating paper

#

because it demonstrates that things which should not work do work

dawn vine
#

i gotta look at multihash...

fallen spear
#

and then fails to extend it

#

sec

fallen spear
#

i have mentioned this paper ad nauseum in the moe reading group channel and also read it for the group, which is on eleuther yt

#

so will refrain from restating too much more of that

#

my mistake: also mentioned on page 3 towards the end

#

and that copy omits the appendix

#
dawn vine
#

so just use M hash functions, one for each of deepseek style M slices?

fallen spear
#

yeah. or: whatever you would normally use to route

#

hash routing is amazing bc it works at all

#

hashing the index of the token is a rock stupid method that still works

#

it is underexplored and can fail in weird ways and you probably don't wanna fuck with it if it's not your main goal to do so

dawn vine
#

sounds a little like this stupid/great paper
https://arxiv.org/abs/2401.02994
(basically, just run a totally random model of 8 for each subsequent token so they 'collaborate')

fallen spear
#

but their multihash is just "what if instead of routing once you routed eight times"

#

turns out it works great

#

this naturally raises the question of why not route d_model times

dawn vine
#

well thats your usual thesis, right? (assuming it can be done performantly)

fallen spear
#

yeah basically

#

if the routing itself is fast and doesn't degenerate, doing d_model routings is at worst the same as doing one routing

#

basically the hash routing paper suggests the type of conclusion i find the most enticing

#

which is that some large and complex component is worthless

#

in this case, trainable routings

dawn vine
#

can u do it performantly just using torch.gather?

fallen spear
#

maybe? i honestly do not know

#

i'm probably out for the night

dawn vine
#

k gnite! gonna keep running MoE experiments 🙂

#

thanks for the ideas!

boreal moss
soft bobcat
fallen spear
#

i saw that! it is interesting that llama seemed to have decided that they didn't really care how wide the layer was exactly as long as it was a good multiple of the gpu ... whatever the thing is called. pool? the thing

soft bobcat
#

if you use 3 input layers to the activation, the dimensions become 2x from 8/3x

#

however, I haven't yet found a layer that beats 2 input layers

#

I think it vaguely matters because of wave quantization; if both d_model and 8/3 d_model must be convenient for wave quantization, that could be an annoying restriction

fallen spear
soft bobcat
#
NVIDIA Docs

GPUs accelerate machine learning operations by performing calculations in parallel. Many operations, especially those representable as matrix multipliers will see good acceleration right out of the box. Even better performance can be achieved by tweaking operation parameters to efficiently use GPU resources. The performance documents present the...

topaz marten
#

Would be interesting to design a model specifically for some batch size given some specific gpu

fallen spear
#

new thought about zerO init: would it suffice to add randomness by permuting your hadamards

#

assuming you have more than one

#

permuting them differently would produce a different result

#

assuming you did not permute them identically

fallen spear
#

@soft bobcat @still grail can yall document any fresh runs on weird activations here or ideas therefore so i don't have to search ot for it tomorrow

#

@sage jetty if you want to noise up the channel I would not be mad either

soft bobcat
#

sure, but most likely that's the final run

fallen spear
#

i will hopefully be running these on pythia 14m tomorrow

soft bobcat
#
x_mult = x1 * x2 * x3
        x = torch.tanh(x1) * torch.sign(x_mult) * torch.pow(torch.abs(x_mult) + 1e-8, 2/3)
``` was fern's idea
#
x = torch.tanh(x1) * x2 * x3

was something I've had lying around, but it's not my best

still grail
#

yeah i think this is reasonable but hard to predict ahead of time when it would do well, it sorta makes sense to me at leasties thosies.... ❤️ :'))))

soft bobcat
#

the bad news is that I don't have a diversity of great activation functions lying around

#

some of them are just better than others

fallen spear
#

apparently clipping the up projection on pythia 70m only takes off 10m parameters

#

this seems incorrect but i'm going with it

still grail
fallen spear
still grail
#

Ye

fallen spear
#

10m total for ff still feels absurdly small

fallen spear
#

oh, fifty million of the params are vocabulary

#

okay sorry i realized it would be high but not that high

#

i've cut my non-embedding params in half, that feels correct actually

#

51m params embedding layer on a 70m model and i just shaved off half of my non-embedding parameters, what do we bet on relative performance

#

for the sake of argument i'll say identical but it won't be

#

it does not seem to give a shit

#

so far

#

will see once the entire thing is done

#

zoomed in last hundred steps

#

yeah so my bet is they converge before the run ends

#

like, completely

#

my second bet is that they don't converge before the run ends, but do if i zero initialize both for reruns

#

my third bet is that if someone lets me pursue this harebrained ablation at a non-toy-model size where ff is actually the dominant part of model size the closeness of fit remains regardless of whether either preceding bet pans out

#

at which point maybe you are better off training for ~4x as long

soft bobcat
#

they probably won't converge. that thing is a loss spike for green so it got a little worse at that point

fallen spear
#

i love it when my bets aren't entirely theoretical

fallen spear
#

for scale of how small the gap actually is

#

the reason i will be wrong is if the reason this is so miniscule is that 5/7ths of the model's brain is in its embedding layer

#

and the ff actually matters at non-toy scales

#

i expected it to be less nice and to have to fiddle the initialization

#

at this scale

#

so that's pleasant

tardy trench
#

What are you guys considering to be convergence?

fallen spear
#

lr has a minimum scale

tardy trench
#

Sorry was missing context here, nvm

fallen spear
#

yeah that's a ppl loss

#

my current theory is that the up projection in a transformer ff doesn't do anything but slightly help signal propagation

#

i was thinking about testing more weird activation functions and then i realized that this was still more interesting

#

and it calls for me to fiddle the initialization anyway which, if activation functions are mostly about signal propagation, is still a good test to have

fallen spear
#

kevin wins, they're diverging

#

will finish the run for completionism's sake

#

amusingly the divergences show up right after loss spikes

#

it's not divergent like not training, it's just not getting closer

fallen spear
fallen spear
#

divergences basically appear after loss spikes and then don't come back so i don't feel, uh, disinclined from the theory that this is actually because the up projection helps signal propagation

#

the smaller model isn't worse generically it's just modestly less stable

#

also, wandb draws the graph badly, on a different view it looks like there's a NaN/inf value at that loss spike at that point but i am pretty sure that's just an error

fallen spear
#

to restate the hypothesis and prove that i have thought about mup too hard: any time you change the activation function or the width, you are implicitly changing the ratio of the model gradient relative to the model weights if you aren't using an initialization that prevents that. this can help or inhibit signal propagation. if this change is the important one with regard to, e.g., changing activation functions then the benefit will tend to trail off if using an initialization that doesn't change this ratio with scale and otherwise making sure you propagate gradients well

#

the dumbest test of this hypothesis is not to change activation functions, which is complex, it is to ablate the dimension of the model under both the existing initialization and conditions which maybe make it more stable

tardy trench
fallen spear
thick briar
sage jetty
#

It ignores the scale of the gradients? Does this mean that if I'm using automatic mixed precision I won't need to scale them to avoid vanishing gradients as long as I'm using Adam?

fallen spear
#

i suspect optimizer behavior is complex enough that reasoning directly about basically their behavior at some limit isn't true in practice

#

ie adam is scale invariant assuming (some stuff regarding the second moment estimation) which means it's not scale invariant

#

it's normalizing the scale somewhat

sage jetty
#

Right, gotcha, so we still should be scaling loss to avoid vanishing gradient problem then...

fallen spear
#

an actually scale invariant adam update is a spherical cow with zero mass that experiences no friction

#

easy to reason about, doesn't exist

sage jetty
#

That takes me back to high-school physics that does 😂

fallen spear
#

@buoyant turret pinged here because i have spammed other channels about this enough

#

i don't think the hopfield network thing is unreasonable

#

i do think it's unreasonable to assume that the original transformer paper got the good ratio correct at 4x in 2017 and it remains true at all scales of d_model

#

i am sorry/happy to have successfully conveyed my obsession at least enough that you're considering it

buoyant turret
#

ideal ratio? no. But when comparing capicty the question becomes does SwiGLU result in better intermediate 'keys' to do retrieval from 'hopfield network' which allows us to scale back to 3/8 compared to 4? I'm less concerned about the ideal ratio, and more concerned about what ratio has equivalent capacity

fallen spear
#

yeah the 8/3 is just equiparams to 4x for a simpler activation there is no reason to think it's ideal

#

i don't think there's even a very strong theoretical analysis of swiglu/geglu

#

everything i have found is just "empirically this seems to work"

#

including noam shazeer saying he has no idea why they work and attributing their success to divine benevolence

#

so if you did "assuming ff up projection works as a hopfield memory, switching from 4x swish to 8/3rds swiglu will have XYZ consequences for it" that would i think be novel

buoyant turret
#

I wonder if we could try replicating the ROME paper at varying FFN ratios with and without SwiGLU and just see wtf is up

#

only problem is we need to pretrain several models with varying FFN ratios with and without SwiGLU first before we can even try ROME

fallen spear
#

oh boy a whole bunch of pretraining runs, that sounds easy and not complex at all

#

also, now I have to read ROME

#

actually it looks complex but not horrifically computationally expensive

buoyant turret
fallen spear
#
buoyant turret
fallen spear
#

it maybe made sense at 2017 scales and also with 2017 attention it meant that you avoided the ff pass being "cold" for gpu utilization during the forward pass

#

like, you might as well put parameters there because otherwise probably compute that is provisioned to hold the attention calculation is idle

#

neither of those concerns currently seems reasonable

#

since i'm putting zerO initialization in neox i need a name that isn't terrible because "zero" is not a reasonable name for a config setting that isn't setting something to zeroes, i am renaming it to identity-hadamard or iu-hd (identity up, hadamard down) barring objection

buoyant turret
#

I'm testing 8/3x now and I arguably have even better compute saturation because I'm able to slightly increase the micro-batch size.

fallen spear
#

identity-hadamard is probably fine

thick briar
# buoyant turret ideal ratio? no. But when comparing capicty the question becomes does SwiGLU res...

The success of swiglu seems to suggest to me that the real bottleneck for a transformer is not necessarily the number of patterns (d_ff) it's able to recognize, but its ability to recognize them.

For a transformer to reason about an entire sentence or paragraph and predict the next token it basically needs to squeeze all the information about the paragraph into its hidden state of size d_model. Now, to make the model more expressive, we can do one of two things. One, we can increase d_model, giving the model a larger hidden state. Two, we can make the FF (pattern recognition) layer more complex. This allows us to encode more information within d_model than we could otherwise, because we have a more powerful method for processing the hidden state.

Ultimately I think the limiting factor for transformers is their hidden state, we either need to make it larger, or we need to give transformers ways to make more of limited space.

still grail
dawn vine
#

@fallen spear

#

Gemma: 16x FFN ratio!!!!

fallen spear
#

and trained for 6 trillion

dawn vine
#

apparently they disagree with your assessment violently

fallen spear
#

i am definitely abandoning my theory that they do the 6x up projection for some specific reason

#

i mean... that's a lot of budget to devote to 6 trillion tokens to end up even with models with a smaller up projection

dawn vine
#

like they had to sacrifice a ton of embedding space for that 16x

fallen spear
#

their embedding space is freakishly gigantic though

dawn vine
#

what? 3072 is small.. RWKV7B is 4096!

fallen spear
#

sorry, i thought you meant the dictionary

dawn vine
#

ah yes

fallen spear
#

250k is uh

#

it is too many

dawn vine
#

maybe that makes up for the small d_model somehow

fallen spear
#

they tie the in and out projections to avoid the model being too horrifyingly dominated by their vocabulary

#

but which seems like uh

#

a really weird choice

dawn vine
#

i love weight tying personally 🙂

fallen spear
#

it's glorious but doesn't scale tho

dawn vine
#

hard in various parallel situations for sure

#

screwed me up good on pipeline parallel

fallen spear
#

their in and out matrices should be 750m apiece and they save 750m by tying them

#

but also

#

jesus christ

#

bruv you have put 750m parameters into your tokenizer, been forced to tie it to keep the budget under control, and then you only trained it on english

#

their embedding layer doesn't fit on my gpu

soft bobcat
#

run the embedding layer on your CPU

#

it's just a lookup table

fallen spear
#

i am not planning to run their model at all i am just bemused generally

dawn vine
#

well i doubt these choices were stupid

#

so the 16x FFN must be a smartish deliberate tradeoff

fallen spear
#

it's a good tradeoff if it's effectively free for some hardware configuration

#

but i am positive that for their use case the embedding is not doing anything, english doesn't have a 250k-size vocabulary that it is essential to learn

dawn vine
#

yeah maybe at 7B, 16x FFN fits nicely on TPUs so its very fast to train

fallen spear
#

given that one assumes their bigger models are ... bigger, i assume the 6x ffn was just doing the same thing and expanding to fill all available space on a tpu

dawn vine
#

@fallen spear unrelated, i still havent tried token based routing on MoE but I was wondering do you know if there's any reason you can't make all the routing decisions up front on layer 0 with a scheme like that?

fallen spear
#

(yes)

dawn vine
fallen spear
#

you can make the decision whenever you want

#

yeah fair

dawn vine
#

I want to use it for real tho

fallen spear
#

i suspect i need to make my own sandbox for that sort of thing

dawn vine
#

or does it being kinda weirdly differently randomized on each matter

#

been trying to speed up deepspeed's slow MoE implementation and the routing seems to be part of the slowness

fallen spear
#

it got slightly better almost no matter what they did

#

you could just salt or rehash for different layers if you want to do that though

dawn vine
#

ok, gotta find some time to try the hash route

#

if i can ever get consistent results with what ive already got 🤣

fallen spear
#

i mean my theory of hash routing is that nothing to do with routing matters except that it be consistent for some fixed subset of input that is of a large enough size to meaningfully specialize in

dawn vine
#

the idea that its related to the token makes sense, i dont think u can just make it literally random

#

but who knows

fallen spear
#

okay but which tokens share experts is random

dawn vine
#

yes, but at least each specific token is relevant to that expert always

fallen spear
#

consistent is the criteria that seems to be important, you have to know that X input will see Y expert again

fallen spear
#

i'll know if totally naively swapping the pythia initializations out for zero causes it to explode shortly here

dawn vine
#

it doesnt explode, but it just does crummy

fallen spear
#

ie, it should be strictly better

#

so exploding or doing badly are also signal

dawn vine
#

you cant just init any old thing to zero

#

some stuff has to be random so the gradient differs

fallen spear
#

zero init is actually a weird combination of identity matrices and hadamard matrices

#

because names are hard i guess

dawn vine
#

o ok

fallen spear
#

i called it identity-hadamard and it appears that it is indeed exploding

dawn vine
#

lol

soft bobcat
#

if you're using this, multiply the terms in the hadamard matrix by 1/sqrt(in-dimension)

fallen spear
#

compared to their recommended implementation?

#

they have a recommended scaling factor etc that i have not dug into

soft bobcat
#

oh, then try their recommended scaling instead

#

I just looked at Fig 1

fallen spear
#

that is the one that is currently exploding so

#

i strongly suspect their initialization just depends completely on, uh

#

a lot of things that are true of their original test and are not true of pythia

#
    @torch.no_grad()
    def linear_ZerO_init_(tensor: torch.Tensor):
        # Algorithm 1 in the paper.
        assert len(tensor.shape) == 2, "linear_ZerO_init_ only works on 2D tensors"
        m, n = tensor.shape

        if m <= n:
            tensor[:] = torch.nn.init.eye_(torch.empty(m, n))
        else:  # m > n
            tensor.to("cuda")
            clog_m = math.ceil(math.log2(m))
            p = 2 ** (clog_m)
            in_tensor = torch.nn.init.eye_(torch.empty(m, p, dtype=tensor.dtype)).to(
                "cuda"
            )
            had = (hadamard(p, dtype=tensor.dtype) / 2 ** (clog_m / 2)).to("cuda")
            intermediate = in_tensor @ had
            tensor[:] = intermediate @ torch.nn.init.eye_(
                torch.empty(p, n, dtype=tensor.dtype)
            ).to("cuda")
            tensor.to("cpu")
        return tensor
soft bobcat
#

their scaling factor is equivalent to mine

fallen spear
#

i might have to squint at it harder

#

all i can see are dtypes and vram allocation problems atm

#

i am not sure how long i am going to wait before i kill this run

#

it's soaring so majestically

soft bobcat
#

just kill it, you won't learn anything

fallen spear
#

@granite plover i can't remember who else likes zero init but it doesn't work if you just drop it in

soft bobcat
#

you can try printing torch.std_mean() of your layers with the two inits to diagnose any issues, if you want

#

I personally wouldn't bother unless I were really interested in ZerO init

fallen spear
#

i am really interested but given that it's tightly coupled to the optimizer and god knows what other hparams i should do this in a truly tiny sandbox at some point

#

basically: this isn't math really, but it isn't engineering either, if you try to do it engineer-ily you just get a lot of null results i think

#

because every remotely scaled setup is already extremely tuned and sits at a specific optima for its hparams that you almost cannot help but disrupt

soft bobcat
#

nah, I don't believe that. it's been easy to find small improvements in LLMs tuning dumb things

#

even fern's repo got improved significantly by naut and it had crazy amounts of hparam search before that

fallen spear
#
  1. link, i probably have it somewhere but could use it
  2. that is probably true but for chasing down specifically initialization-dependent effects here it seems like it'd be a lot nicer to look at things (activations, initializations) in something much more like a sandbox and make adjustments there rather than guessing at needed adjustments at any kind of scale
#

if extra lucky it starts to feel closer to math and just generalizes easily

soft bobcat
#

airbench #implementation-details message

#

fern's repo is in a strange state where giving the EMA higher resolution makes it perform worse

#

it could be that the EMA is not parametrized correctly, because I didn't check the math. but I found it unusual

#

airbench inherits from fern, who inherits from David Page, who inherits from one of the dawnbench leaders. it was SOTA from the start and each person made major speedups on top

fallen spear
#

neat. anyway, i think for chasing this hypothesis specifically i am out of remotely reasonable ideas that I wouldn't want to play with in a sandbox first

still grail
#

Is there something you're seeing that's indicating that the EMA would be parameterized incorrectly?

soft bobcat
#

no, I have no evidence of any error. I just found it extremely unusual that a coarser approximation does better

#

now I'm thinking about the fp16 weights though...

still grail
#

Well it is a lookahead optimizer, so that will have some impact

#

I've had similar questions of coarseness at the end, and have tried a few different configs, but even at the end of training the semi-coarse EMA seems to be king

soft bobcat
still grail
soft bobcat
#

512/batchsize is not a big number

still grail
#

I think the sum part cancels it out

#

= 512 is my understanding

soft bobcat
#

torch amp starts at 65536, only goes down when necessary, and even rises to accommodate small gradients at later stages

still grail
#

Ye

soft bobcat
#

my current suspicion is that EMA is sensitive to fp16 quantization

still grail
#

Alright

soft bobcat
#

meh, it's all conjecture. without me actually putting in the work to test anything, I don't feel my guesses have value

buoyant turret
#

what the fuck is Gemma doing? 2048 d_model with 16384 d_ffn????????

buoyant turret
#

AND 3072 d_model with 24576 d_ffn?????????

#

THEY'RE USING GELU??? IS IT AT LEAST GATED HOLD UP

#

oh my god

#

it's gated GELU

#

with 8x ffn ratio

#

what the fuck

#

not 8x up 4x down. it's 16x up, 8x down.

#

so actually an 8x intermediate size

#

and apparently builds on "advances made with gemini"

#

so has google ablated and found larger FFN really does work significantly better???

fallen spear
#

they showed the mf 6 trillion tokens and it's like. dead even with comparable models

buoyant turret
fallen spear
#

leaderboards are noisy and influenced by the fiddly bits with how you do your ft

#

so they are a definite datapoint for "at scale maybe it doesn't even matter"

#

huge ratio? sure. small? also fine. whatever gives you good utilization

#

i would think it was a stronger point for that hypothesis if we knew for sure how many tokens mistral had seen

fallen spear
# granite plover noooooooo

yeah sorry. i assume the hadamard is scaled a lil' differently than the existing inits and the rest of the hparams are tuned to that scale

fallen spear
fallen spear
fallen spear
buoyant turret
fallen spear
#

call it 1/8th faster neglecting embeddings

#

it is "equiparams" but just seems like it would be really dependent on hardware configuration

buoyant turret
#

lleme check the gemma code. because they also might be doing parallel MLP... nope. serial attention then MLP. it's such and odd bunch of design choices. some seemingly for speed, some seemling counter-intuitive

fallen spear
#

... actually, do they count the embedding params in their param count?

fallen spear
fallen spear
#

and they have like .75B of them

buoyant turret
#

you can't train parallel MLP and then deploy serial MLP because you're missing a whole bunch of layer norms

fallen spear
#

oh you mean something actually different, i am just thinking of model parallel

buoyant turret
#

no i mean attention(LN(x))+mlp(LN(x))+x where the LN is shared between them both

fallen spear
#

ahhh, yeah

buoyant turret
#

PaLM did that, but Gemma does not

#

they're also using a key dim of 256 which is fucking nuts

fallen spear
#

tbh given the error in the gemma report about the norms i am not sure i trust any of their reported architectures to be correct

#

i trust that the reported architectural choices in every report were probably done at some point in GDM adjacent to a given release

buoyant turret
#

im looking at their actual code, so report be dammed, this is what's actually happening

fallen spear
#

but possibly not all at once and possibly not in the released model

fallen spear
buoyant turret
#

i mean... there could be errors in the pytorch implementation, but since pytorch and JAX load from the same parameters file and I assume their JAX implementation is defacto... there probably arent any errors in their pytorch implementation architecturally

fallen spear
#

i think it is correct that code will reflect the actual model architecture

#

i am just a little iffy about using palm as a reference since we only know its architecture from their published reports

buoyant turret
#

the fact that they (iirc) ablated the difference between serial and parallel attention/mlp at least leads me to believe that part is correct in their report

fallen spear
#

i definitely believe they ran that experiment and am totally unsure if i should believe that anything available by api or release actually reflects it

still grail
buoyant turret
#

256 head dim... but 16 heads for the 7B model... with a hidden size of 3072?

#

what the fuck?

#

what the fuck is this model?

#

2B has 8 Q heads and 1KV head. 2048/8=256 so that lines up.

7B has 16Q heads and 16KV heads. 3072/16 != 256.

buoyant turret
#

but they don't have 12 heads. they have 16.

fallen spear
#

overlap, maybe?

buoyant turret
#

nope.

fallen spear
#

lolwut

buoyant turret
fallen spear
#

so it's not actually 256?

#

like, one of the assumptions here must be false

buoyant turret
#

no it is actually 256

#

config says head dim is 256.
self.head_dim pulls from config.head_dim

#

for the 7B model QKV are all 3072*4096 matricies

fallen spear
#

and they downproject them after doing attention?

buoyant turret
#

yup. O is 4096*3072

fallen spear
#

that is incredibly weird and makes me wonder what they are actually doing for attn

#

like, is this actually a standard qkv calculation or

still grail
#

What is this magical starngeness.....

fallen spear
#

not only do they think i am wrong about the ff up projection they put one into their qkv too

buoyant turret
fallen spear
#

i am glad at least that the qkv impl is not also an eldritch horror

#

this does sort of break intuition about what qkv is doing

buoyant turret
#

JAX implementation looks pretty normal too. they're using the dot product attention function from FLAX, but I assume that's just normal qkv

fallen spear
#

purely abstractly this basically gives you extra heads that don't "make sense" in a normal transformer, no?

#

and you also get to use your output projection to actually do some logic while downprojecting

#

my next question is: did they mean to do this or did someone whiff their math

#

and if someone did whiff their math, did they end up with this final architecture specifically because this configuration "just worked better" in a way that might be attributable to the attn up/down projection

buoyant turret
#

who fucking knows

still grail
fallen spear
still grail
#

I'd guess but I'm not entirely sure

fallen spear
#

gonna be fun to test when i am not on my lunch break

dawn vine
#

I'm definitely gonna try some of this stuff w rwkv

#

I like the fewer layers trade for ffn and better wall clock

#

I also wonder if we can merge these changes with the way mamba integrates ffn and attn up/down projections into a single block for extra benefit

#

Tho that would break my moe

dawn vine
#

WAIT... from Gemma:

We use a subset of the SentencePiece tokenizer (Kudo and Richardson, 2018) of Gemini for compatibility.
Does that mean that Gemini uses an even bigger vocabulary size???

soft bobcat
#

imagine using a 512k+ vocab size instead of putting some of your 1000+ ML researchers to figuring out tokenization

dawn vine
#

What kind of 'compatibility' exactly are they achieving...

#

And are some of those other tokens reserved for multi-modal maybe?

fallen spear
fallen spear
thick briar
#

which shouldn't change the intuition for how information flows in attention

thick briar
#

Attention takes up a lot of parameters but does no real computation over the sequence, it's just an information mixer

#

If parallelism wasn't useful then I would advocate making the FFN intermediate dimension as small as possible and duplicating FFN layers to make up the difference, increasing the depth and thus computational complexity of the model

dawn vine
#

kind of seems like better way to achieve parallelism is bigger d_model and smaller FFN tho

#

can always shrink for attention

#

or use subset of d_model

boreal moss
#

so did anyone tried trick used sometimes in cnns that is to use only part of the model width for time mixing layer (attention in this case) and the rest just go straight through? this way it is possible to use smaller ffn multiplier and keep attention dim low

#

I was talking about this to @fallen spear some time ago

#

usually in cnns you can do conv on only half of the model dim and it doesn't change performance much or at all but runs considerably faster

buoyant turret
boreal moss
#

and that is very counterintuitive, I don't know why but it doesn't look at all that you need much mixing in that dimension in comparison to ffn parameter count

#

I'm talking from my own experiments

buoyant turret
#

i wonder if the lack of channel-spatial mixing weights lets the convolutions learn features more easily?

#

kinda like how I've noticed in some cases LoRA converges faster than full-finetuning because there are literally fewer variables to optimize

boreal moss
#

the same happens with large conv kernels, they learn very slow, but when you add parallel additive small conv kernel (which is completely redundant) it converges much faster, then for inference you just absorb that small kernel into the big one and compute just that

dawn vine
still grail
#

Depthwise conv is horribly time and learning inefficient

#

You get half a kernel for the price of two!

#

It's a good idea in theory but really only seems to be best suited for certain cpu-only inference options that aren't 3x3 conv friendly. ❤️ :'))))

still grail
still grail
still grail
#

Yeah good point, I forgot about that bypass

#

Kernel launches are just way too slow though, gotta go fast gotta do it all at once! XD ;PPPP

boreal moss
#

depthvise convs are very slow on gpus but that is rather happy little accident that this trick works so good

boreal moss
#

that you can just not compute half of them and it doesn't tank performance

still grail
#

It depends upon the problem, I believe.

boreal moss
#

I see that behavior on image and language models

still grail
# boreal moss I see that behavior on image and language models

Yeah I guess I could see down sampling working okay, one of the big problems is it still almost always requires a second kernel to either merge the info for passing or re-upsampling which can be, er, rather pricy in the super-efficient edge cases. 😬😬😬😬

#

It would be nice to be able to do it in a super efficient way though

#

Haven't been able to find one quite yet for that AFAIU

dawn vine
still grail
#

I'm a bit confused I think. Generally you have to do a concatenate or the like in order to avoid a bifurcation of kernels

dawn vine
#

im assuming torch.compile to remove multiple kernel calls tho

#

under the hood it wouldn't have to actually concat and could just work on the first half in-place, it could just copy off the original for autograd use later before it begins work

#

either way tho, we may just be in very different headspaces here... one extra CUDA kernel per layer is a very small price to pay for reduced compute/params from where im standing 🙂

thick briar
# dawn vine or use subset of d_model

maybe FFNs can only use a subset too, but they would be overlapping, like even-layered FFNs can use the top 2/3 of d_model, odd-layered FFNs can use the bottom 2/3

#

and maybe attention can use the 1/2 in the middle with 1/4 on both sides

#

that way we have a larger hidden state with a "message passing" area between FFNs

thick briar
tardy trench
#

And if you try to replace the queries with trainable vectors, perf drops even further

thick briar
#

But in terms of actually "thinking about" the input sequence, attention doesn't do anything

#

Its job is to mix info for the benefit of the FFN layers

#

I believe though that at large model sizes we have too much attention

#

Which is why GQA works so well

tardy trench
#

GQA/MQA basically work because of non-identifiability (https://arxiv.org/abs/2007.00810) but I think you still need approximately n_query_heads*d_v = d_model for good perf. I think anything that decreases the LHS will hurt performance since the attn block residual will have rank < d_model.

#

At that point, if you use n_query_heads*d_v < d_model, you're basically betting that the optimizer can't make good use of the extra dimensions

#

Maybe for a given param budget, ff is a better investment, but if you're holding ff constant, using less attn params will always hurt performance for a sufficiently complex task

#

Maybe we don't disagree, idk

thick briar
#

For example llama-70b, do we really need to retrieve in full d_model information 56 times in a row?

thick briar
tardy trench
thick briar
#

GQA working basically means that each token needs to transmit far less information to its peers than we previously thought

tardy trench
#

Technically yes, but the capacity difference between using multihead and multiquery is actually quite small. Multiquery basically just factors all the key projections as W_k*W_rh, and then moves the W_rh term to the corresponding query projection W_qh (and analogously for the value and output projections)

thick briar
tardy trench
thick briar
thick briar
tardy trench
#

Yes, so far!

fallen spear
#

selectively mixing info is thinking about the sequence

thick briar
#

If attention did computational work that was actually useful, then parallel attention/MLP layers would suffer a lot of degradation

fallen spear
thick briar
#

fair point

dawn vine
#

@fallen spear finally tried the most stupid possible hash routing MoE and it did great... i literally used expert_id = token_id % 8

#

same on every single layer

still grail
#

Come on don't chimken out nowwwwwwwww......

Dooo it doooo it doooo it doooo it

fallen spear
fallen spear
still grail
fallen spear
#

one of the failure modes of hash routing was when they made modifications that created too many buckets

#

the bucket:expert ratio appears to be sort of delicate

still grail
#

Surely it worked here, clearly and obviously it works in the limit

#

You must not have been brave enough

fallen spear
#

i mean i guess the issue was there were too many more buckets than experts

still grail
fallen spear
#

it is possible that proliferating experts is fine

#

the way they proliferated buckets btw was to hash tok and prev_tok instead of just tok

still grail
#

Here's a (maybe) fresh idea: hierarchical bucketing. Have one always on bucket. The second bucket is chosen of two. The third is chosen of four, etc.

Should bypass the information clustering per token issues a bit.

Buckets are all uniform in size.

#

This basically, if the grouping is done "correctly", should let the "active buckets" switch at appropriate levels of granularity per the active incoming info stream

fallen spear
#

i don't understand this at all

still grail
#

How you do token grouping can I believe be learned in a differentiable manner using the same exact freaking structure of fast feed forward networks. Should work similarly too, I thinks?

still grail
#

Allows for hierarchically increasing refinement specific to each input

dawn vine
#

I love that this sparked u guys coming up with insane amazing ideas

still grail
#

Making it learnable helps reduce "wasted" information learned across buckets by forcing it to be deferred to higher in the hierarchy, but in a differentiable manner to avoid any silliness/complexity from a given scheme -- let's let it be data-defined! 😄 :')))) ❤️

still grail
#

I'm honestly not sure where in the heck this is coming from myself lol

#

But I'm just letting it roll, ya know? XDXDXDXD 😭😭😭😭👍

dawn vine
#

Please let it roll!!!

#

I'm in the middle of moving so I don't understand the tree idea yet but will reread later. I love fast ffns

fallen spear
#

we have a token T, it hits a routing layer (which might just be the embedding layer), what occurs?

still grail
# still grail Binary tree where output active expert group is a concatenation of the weights f...

@fallen spear does that make sense now.

So like, the head node is always on. This is the always active group. The second depth layer is selected by [some strategy], preferably data dependent like being chosen based on the raw token or something similar to that (can also do like for example a conv of an n-markov chain of the input embeddings of previous tokens, for example).

Anyways, the second group is chosen based on that, and concatenated to the running set of weights as before.

I believe it should work out naively too w.r.t. the swiglu/gelu nonsense or whatever. Just basically do your own build-a-bear "Build Your own Ball [of weights]" kinda dealio.

still grail
#

@fallen spear I guess if we're doing our MLP/whatever routing layers based upon the token embeddings, we can basically do something where we do the branch traversal as in FFN.

So, we always get the base node of the tree "for free", these are the always on expert weights.

Then we do our decision layer, and get our sigmoidal left/right branching for the first leaf node of the tree. During training, this can be the sigmoid-gated sum of the two values, during inference it can be a hard selection based upon the sign of the value (assuming no bias and all'of'dat).

This is the second weight group "block". This is kept in a list with the first weight group. If our tree is 8 nodes deep, for example, then each weight group block is of the size mlp_depth // 8. Additionally, this means that our tree will have 2^n-1 total weight blocks, each with depth mlp_depth. This can be a thing that is pretty sizeable pretty fast but also I think that's alright as it's also potentially pretty flexible space wise.

They all get added back in to the main values at the end of the linear block, so thankfully we (should is the keyword heresies) be able to independently pick-and-choose each block as needed over the course of training. And, of course, because it's FFN stuff, it gets jointly optimized which can be both a good and a bad thing, but at least it's sorta simpler engineering-wise (once it's up and running IMPE at leastsies....) so that bodes well fur scaling, I thinks.

This is exceptionally nice as well as it embeds the idea of a hierarchy of knowledge in the network, and it switches on and off as needed depending upon the input data. This feels much more similar as best as I understand to how the human brain function and how the astrocytes function w.r.t. affect associative lookups, so it at least feels to me like it's some kind of step in the right direction.

#

I'm still waking up and have no idea what the heck mood this is that I'm in but it's nice, I haven't had consistent "idea storms" for a few years now, so this is generally a rather confusing but pleasant experience for me. :'3333 ❤️ 🥲 ❤️

fallen spear
#

one of the nagging problems with moe is that experts are very similar to each other

#

(also I understand this now)

fallen spear
still grail
#

Yeah that's why a binary tree is good I think

#

It's not perfect of course, but like you can of course adjust ratios (of layer num_parameters and branching factors, etc, etc) to accommodate for that.

still grail
#

The output projection, I thinks? I guess?

fallen spear
still grail
#

What's also cool is that if you want uneven depths in your tree, or some other kind of truncation/compression, you can do so as well, at least mathematically IMPU

#

It would screw with batching a bit (well a binary tree expert selection mechanism would be a bit annoying to code period, but also might be quite useful as well. ❤️ :')))) )

dawn vine
#

so this would be like a further set of levels of what i'm already succeeding with

still grail
#

So this then is basically just stretching it along the tree traversal route

#

Since some levels of weights are probably somewhat group-specific, but aren't entirely "all or nothing" like an always on, or some extremely group-specific weight impe....

dawn vine
#

recently this has been as a continuation from an already-trained RWKV model w/ pre-existing FFN

#

so this adds onto it, starting with all zeros and slowly learning to differentiate

#

seems to work great

dawn vine
still grail
#

Now I'm a bit confused lol

#

For me, the always on bit is weights that are basically always used with no gating

dawn vine
still grail
#

Then an FFN is used to progressively select the conditional branches of weight layers

dawn vine
still grail
#

Not Fast Feedforward network

dawn vine
#

yeah sorry not FFFN

still grail
#

Yeah I think that was my bad

dawn vine
#

I tried other ways of combining the base FFN and the chosen expert, but additive seems better maybe especially when starting from an existing trained model

still grail
#

The nice thing about token specific selection I guess if I'm understanding this correctly is it makes a lot of the "predictive" (or predictive related) aspects of things easier to manage

#

And here, I'm meaning that the outputs of the two are added together

#

Oh, I guess there's the sigmoid gated additive interpolation or whatever

dawn vine
#

yeah my recent implementation is like

out = x + ffn(xn) + moe(xn, tokens)   # tokens passed so i can hash em```
still grail
#

And theoretically that could be fused into one kernel without breaking the bank, I thinks (maybe a bit moar complexity thosies....)

#

Buh it seems to be hyperscaling friendly basically! 😄 :')))) 👍

dawn vine
#

well you're missing some stuff like Expert Parallel requring some comms in the middle so hard to do that in practice

still grail
dawn vine
still grail
#

Variance of expert selection has got to be a bear, lolz

#

I would like to propose selective expert leaf dropping as a variance/routing management strategy in the binary tree case

dawn vine
still grail
#

One could make the block sizes get smaller as you go towards the leaves (or maybe just keeping the tree not too terribly deep), but leaves maybe could be dropped in favor of a "generic" set of weights learned as a backup in case some batch inference scheduler determines that it would be too costly to run some kernel for a horribly miniscule amount of nodes

still grail
dawn vine
#

might be a bit slower since i gotta do several levels instead of one, but i like the idea

still grail
#

The tree does include some notion of structure though so I think any binning strategy should try to take that into account

#

Otherwise it's just throwing darts in higher dimensions thensies. ❤️ :'))))

dawn vine
#

I also would like to try switching out the experts with LowRank approximations and/or Butterfly Matrices

dawn vine
#

the main thing seems to be to have the same 'expert' always be consulted for a given token

still grail
dawn vine
#

so this would still keep things consistent even w/ a tree based on each bit of the hash

#

it doesnt need anything informationally, it just allows us to store more info in fewer parameters

still grail
#

😬

#

Lemme think about this

#

Because hashes are supposed to be rather spicy in how they jump about

dawn vine
#

yah thats their goal, but in MoE apparently the only important thing is that the same expert is consulted for the same token always

still grail
#

Ideally, (intellectually, which is def different from experiments as we all know lolz), one would want a higher node in a tree to correspond to a token group

dawn vine
#

so as long as token X always goes the same route through the tree i think we're fine

still grail
#

The tree is meant to dedupe that information by clustering it into hierarchical nodes

#

So that way, we have token routing consistency, but then also similar tokens get routed through similar branches of the tree

#

Hence, more "free space" to learn various concepts

#

The problem of course is that this is data conditional, but thankfully if you do it based on the token itself, then you can just skip the routing function entirely and split the learned embedding up to route it. I think? That should be efficient?

dawn vine
#

I don't know if I believe in the 'similarity' idea any more... @fallen spear dissuaded me with this hash routing stuff

#

because if similarity mattered, hash routing shouldn't work as well as learned gated routing

#

but it does

still grail
#

routing_table = nn.Embedding([num_tokens, tree_depth*total_num_layers]) routed = routing_table(inputs).sigmoid().view(tree_depth, total_num_layers) # optimization hack: repeat the branching strategy on each side of the tree when weight averaging. Should converge albeit being a bit strange

#

Here's an efficient input-token-dependenr learned routing table that uses a routing symmetry approximation to make the process much simpler (and doesn't need an FFN, to boot!)

still grail
#

Like it feels like one that def needs an explanation

#

(not saying you have to give it, I'm meaning it seems like it's something I'm curious about sorta)

dawn vine
# still grail (not saying you have to give it, I'm meaning it seems like it's something I'm cu...

it seems that the reality of the situation is as follows:
a FFN has the capacity to learn a certain amount, but it can learn how to deal with lots of different kinds of inputs (proven by how we use them traditionally for every input!)
so if you add more FFNs, you just gotta make sure the inputs roughly match each time so that you're now spreading the computation across these, allowing you to train and inference as fast as before but with much more parameters available

still grail
#

Just because duplicating knowledge feels kinda like a Bad Thing, especially if we can control things relating to it

still grail
#

Having consistent routing is def good

#

Would be insane otherwise

#

But one problem is that the amount of paper parameters and the learned parameters are a bit different I think due to the network having to double, triple, quadruple, octuple learn things, etc

#

Consistency is def good because it keeps a match between those things

#

But a question is how to factorize speedily the process so that shared learned things sorta all generally live in the same place, as it were.

dawn vine
#

oh also we could reduce the size of the FFN at each successive layer

#

because it's handling fewer tokens

still grail
#

Yeah I was thinking about that for the leaf nodes in the binary tree example

#

I could see there being distributed nodes that basically fall back to blocksparse as the leaves themselves can get rather specialized

#

Which is interesting as it sorta ties throughput to useful model capacity, with tons of throughput, you can add more and more leaf nodes to the binary tree without as much direct negative impact iiucsies

dawn vine
#

would be real interesting to add half size each level, so you end up linear param ct in the number of layers

#

i guess that kind of kills the benefits of GPU parallelism if you keep making it less and less parallelizable

still grail
#

Yeah that would be interesting

#

That feels friendlier to scaling people at least lolz

#

Because eventually nodes can handle multiple layers at once (and you can colo them!!!!!!)

#

The halving is intriguing and I wonder how well it works

dawn vine
#

yeah its crazy

#

im interested by it

#

but i dont know how to implement it so its not super slow

still grail
#

Does the embedding routing trick from earlier make sense

dawn vine
#

whats the difference between tree_depth and total_num_layers - is total_num_layers the layers in the model?

still grail
#

This would predict a routing tree for every MLP basically at the start

#

That might even be redundant

#

But I'm not sure and wanted to err on the side of caution

#

With the depth halving one could maybe just calculate all of the experts at once and then fold them in with a shallow depth tree (say, depth 3-4 or something like that)

#

To avoid linear layer weight madness

dawn vine
#

OH i forgot that each layer of depth in the tree is independent 🙂

#

its all just additive, right?

#

so its super parallelizable in some sense

still grail
#

Yeah

dawn vine
#

coooool

still grail
#

So you can sigmoid post hoc

#

To basically get the same behavior

#

But keeping the weights dense

dawn vine
#

that's like FFFN

still grail
#

Yep

#

Oh haha also I think you can use 2:1 structured sparity for each leaf node by just interleaving weights together and selecting one of two (or just inverting) the binary weight masks, maybe that's silly but I think it's hilarious.

#

Maybe specialized solutions would be better

#

But your halving idea made me think about that

#

They can always be loaded interleaved as one "set of weights" in memory, then the mask can be shifted up or down by one to "let through" the proper outputs for each item lolz

dawn vine
#

so basically (for the 1/2 size each level idea) for a tree of depth D we just compute D normal sized FFN's in parallel, right? then we can select a subset of the results based on the routing table

still grail
#

Yes exactly

dawn vine
#

so cool

still grail
#

Wow I felt really giddy when you wrote that lol

#

This is really freaking cool isn't it ':DDDD

dawn vine
#

I really love it. Feeling giddy too

#

I have no idea if it works hahahahah

#

But it's really awesome

#

after considering it for a minute, my guess is that just like FFFN, it's only really helpful at inference time

still grail
#

I think a good blending strategy would be to apply the structure of the routing table to the tree to mix things together. So for example if the node structure is .1 .8 .2 during training, we can apply this structure to each node. Because we're not predicting the entire tree, only one index.

This has an advantage of not taking up much space, and also oddly enough it implies structure across the tree earlier on in training, which your hashing stuff shows us I think shouldn't be catastrophic for the model (and it should basically factorize out as the branch selection gets more and more popular)

still grail
still grail
dawn vine
#

yeah i have a [very frozen] community project about doing that branch locking for FFFN

still grail
#

Probably with some small penalty by having learned shared info between experts, but as you noted earlier, if the network can take the abuse of hashing, then a little bit would be not too terrible maybe

#

I have a feeling it will be a logarithmic kinda thingie

dawn vine
#

I think the key is just to make your choices early on in training and stick to em

still grail
#

Yeah that works I guess

#

I think as long as the bulk of the higher level conceptual layers (like higher up in the tree) shared between sub-nodes are generally informatively positioned, then that's more important than the structure of the particular specialist layers

dawn vine
#

I'm trying to think of a 'fair' test for my existing code that would add a second layer and work somewhat in this fashion

#

i guess i can use 4 normal sized experts at depth 1, and 8 half sized experts at depth 2 (to total the same as my current 8 normal sized experts)

still grail
#

Would it be 1/4, 1/4, 1/8, 1/8, 1/8, 1/8 depths?

#

(for the two layers)

dawn vine
#

well currently I choose 1 of 8 1x sized FFNs and add the result to the depth 0 1x sized FFN
so I was going to change that
from 1FFN + 1 of 8 1FFN
to 1FFN + 1 of 4 1FFN + 1 of 8 0.5FFN

#

more compute, but same parameters

still grail
#

Oh okay, gotcha, sorta b-tree kinda-like business goin' on ovah hier. ❤️ :')))) 👍

dawn vine
#

yeah only bc i need some baseline to test against

#

and it started life as an 8 way b-tree of depth 2

still grail
#

See you invented this first, you have a Schmid window schmid schmidhuber

dawn vine
#

haha you invented it too

#

i cant believe there are TWO schmidhuber emojis LOL

still grail
# dawn vine i cant believe there are TWO schmidhuber emojis LOL

I'm sorta a little excited. No need for details if there aren't any (sorry if so 😬) but I'm sorta curious about experiment plans if there's anything at the moment. Maybe I should look into trying it out under hlb-gpt at some point toosies, seems like a decent fit for it. ❤️ :'))))

dawn vine
#

would love to hear if you do

#

and if you need MoE code I just wrote some new very short code that works within deepspeeds MoE stuff

#

but tbh you're better off testing it without all the trimmings, in which case its ez nuff to just write (without expert parallelism support etc)

still grail
#

If there's any good raw PyTorch code I can take a gander, otherwise I can write it myself

#

It doesn't seem too complicated, especially as the routing embedding simply assumes that the switch controls every pair of branches the same way at each layer

dawn vine
#

yeah the only tricky part is that the switching has to be per token

#

and yet you want to operate in parallel

still grail
#

Yeah I'm probably gonna be lazy and just post hoc merge

dawn vine
#

but you can just do all the calculations and choose which to use

still grail
#

But I'd like to maybe take advantage of 1:2 / 2:1 sparsity whatever sparsity if it's not horribly borked, to do the leave interleaving trick

#

Oh. Hm. Maybe there are some fun ideas heresies. My experience leads me to believe that a universal, singular kernel launch is faster, but stills.......

#

Hm

dawn vine
#

well u can experiment with it being slow and crummy and wasteful and if you find something here is good we can then worry about an efficient implementation

#

as long as its fast enough to run some good tests

still grail
#

Yeah exactly

#

I've beat my head into the wall too often looking for clever fast implementations when sometimes just trying a bunch of dumb, (or maybe better yet, "dumb") shit works out pretty well in da endsies. ❤️ :")))) 👍

dawn vine
#

@still grail i was thinking some more about all this, and got to thinking about the relationship with DeepSeekMoE where they split up the FFN experts in to a zillion tiny parts which they mix and match

#

seems like our tree approach has similarities, except that we were going to choose certain wider experts as lower levels of the tree, where deepseekMoE is more like a shotgun, choosing say 50 out of 1000 same sized tiny experts

dawn vine
#

in the limit, deepseekMoE of a d_model x d_ffn x d_model FFN becomes d_ffn different minimal d_model x 1 x d_model FFNs
so what if we hash shuffled the actual minimal ffns, such that out 1000 we choose say 50 out of which we create a dynamically generated d_model x 50 x d_model FFN
effectively, each token gets its own custom FFN made out of 50 of 1000 minimal parts
this seems ideal but maybe very slow to compute
so maybe the idea is the tree approximates it but is less tricky for GPUs

fallen spear
still grail
fallen spear
#

it works okay

dawn vine
fallen spear
#

but if you enforce a consistency loss between experts it forces them to share information

#

you can also have

#

a lora

#

it can even be full rank

#

but

#

one of the matrices shared and the remainder experts

dawn vine
# fallen spear a lora

splitting the FFN up a la deepseekMoE is essentially decomposing it into its constituent LoRAs

fallen spear
#

there are a few sort of clear ways to reduce duolication

fallen spear
#

i disagree

#

lora can be full rank and the interaction between the decomposed matrices is multiplicative

#

these run "side by side" with each other

dawn vine
# fallen spear i disagree

how come? the math is identical isnt it if you split apart all the middle layer neurons from a big FFN into their own tiny FFN and sum the results of all of those tiny ones?

fallen spear
dawn vine
#

maybe we mean different things when we say LoRA

fallen spear
#

possibly

dawn vine
#

i really just mean a bottleneck 'FFN'

#

LoRA tho is typically adding the results of such a bottleneck to the base value of a full FFN

fallen spear
#

yeah this is true

dawn vine
#

sorry, I guess I'm abusing terms when I say LoRA and mean just a bottleneck

fallen spear
#

i am proposing no bottleneck

#

and no base

dawn vine
#

all I was saying is that a d_ffn wide FFN is made up of d_ffn 1-wide bottlenecks, summed
and that maximally, this is what DeepSeekMoE does as a decomposition

fallen spear
#

you simply express each ffn as a product of two matrices

#

i mean, that is true

fallen spear
#

one is not

#

presumably shared knowledge will end up in the shared one

dawn vine
#

'each ffn' as product meaning each expert?

#

sorry, are you saying that the expert's weights should be the results of two matrices?

fallen spear
#

yes

dawn vine
#

ahh

fallen spear
#

and one of those two should be shared over experts

dawn vine
#

the mixup was that I thought you were referring to how to calculate the expert, but you were referring to a kind of generative fast-weights idea of how to create the actual feedforward weights themselves rather than learn them directly

fallen spear
#

tbh i have never understood why there isn't more weird matrix factorization going on in general

#

we have a very solid proof of concept that in one specific case the technique gives you vast gains in efficiency

still grail
#

I think it's that it can create a lot of potentially unwanted instability over the course of training

#

Depending on how it's done

dawn vine
#

well this is pretty out there... i mean, i havent seen any code ever that learns weights indirectly like this

#

presumably due to stability reasons

#

in other words, i love it

fallen spear
#

not knowing why something isn't done makes me think I should not do it

dawn vine
#

oh wait, i thought that was supposed to be a double negative... does it make you want to do it or not?

#

🤣

fallen spear
#

it makes me want to do it, not knowing makes me assume that there's some good reason I should not that i don't know yet

#

instability is an interesting problem here

#

speaking of

still grail
#

I'd say go for it if you want if you're wanting my personal perspective on it, you just have to keep in mind that having weights generate weights as a function of other weights will sorta make the Jenga tower a bit more rocky unless you're rather clever somehow with it. :3 ❤️ :'))))) 👍

fallen spear
#

i have given up on using neox for ordinary experiments, it is just really not hackable enough

dawn vine
#

its specifically designed for LLM experiments

#

everyone who has used it loves it (that's not many people tho, and im sure it can be improved)

fallen spear
dawn vine
fallen spear
#

welp

dawn vine
#

i dont actually specifically want you to use it, it just seems convenient for the task at hand

#

since it was born out of my frustration w/ all the existing tooling

fallen spear
#

i have it and two other things bookmarked as maybes

dawn vine
#

it definitely has deficiencies.. for example its not designed to load existing weights particularly from other projects

#

its just designed for doing experiments w new architectural changes

#

esp on consumer hardware

fallen spear
#

i think i am definitely in an experiment running place, i will probably fiddle around before i settle on what's comfortable

#

i have uh

dawn vine
#

if u find something better for that i'd love to know that too!

#

bc i might use it 🙂

fallen spear
#

i have caught an excess of thing to do in general so i am probably out if commission at least another week

#

the other two things i know of with similar are chili and neuralink's thing and another sandbox someone has

#

lemme get off of mobile

dawn vine
#

cool, link me to em when u get a chance id like to check them out

dawn vine
fallen spear
#

i have another one somewhere

dawn vine
#

hmm nanodl looks the most relevant, but jax and not focused on data based configurability and ease of running

still grail
#

@dawn vine i think i figured out a good general landing zone for a dynamic weight communication motif

#

(whoops, sent it earlies deresies 😬 😬 😬 😬 😭 😭 😭 😭 )

#

i think if'n we do indeed try to lean on the input token (or something similarly-fixed like that) as the keys for choosing the experts within some paradigm -- if that holds up in some way (which i think it should, even based on the hashing business that you were doing earlier), then all methods that rely upon that for dynamic weight generation i think can do AOT lookup and communication if the function combining those values is linear

#

I know that's for dynamic weight generation which is a little different

dawn vine
#

u can ping me any time ill have it off if im asleep or whatever 🙂

still grail
#

but being able to do it AOT should hide some of the latency of communication a bit

#

just simply because the only dependency for generating weights is "input token", and so that means it's easier to do all of the layers sorta all at once instead of with a sync each time

#

might not be useful but just something i was chewing on w.r.t the distributed side of the dynamic weight generation side/part of the problem. :3 🙂

dawn vine
#

like either way u gotta wait for the prior layer/block to complete so you have the embeddings to transfer over
but yeah at least you don't THEN have to operate on them before knowing where to send em

#

i skimmed a paper that did a kind of pipelined work to allow the comms to get pipelined as well

#

i assume they split up the prior layer's work into chunks, so that each chunk could start getting sent across the wire when it completes

still grail
#

@dawn vine @fallen spear still working on this, definitely coming along the train of thought that 'switching is more expensive up front even if it gives more capacity', I've visited the idea of branching/cloning weights in the past so I want to revisit/keep visiting these things.

#

As far as the binary tree stuff goes, I'm still exploring this! Trying to find a semi-efficient proxy that's maybe's a halfway solutions heresies.... ❤️ :')))) But I am definito makin' some progress.

#

Right now I'm exploring something semi-completely different that I've visited a number of times over the past few months, some of @fallen spear ' comments keep spurring me to look at it again.

#

Some promising initial progress on that, but no word on it yets. ❤️ :'))))

dawn vine
#

Cool, I've been temporarily distracted from MoE by other stuff but eager to get back to it!!

dawn vine
#

@still grail im back at it on MoE experiments now if you want to revisit

dawn vine
#

Unclear if that's because the new hash function was just luckier or it actually matters

#

I multiplied by a prime number for each layer: expert_idx = (token_idx * primes[layer_id]) % num_experts

fallen spear
dawn vine
#

i may really use this in production lol
kinda scared

#

i also have a maybe even more nutso routing mechanism (like how I actually choose where to send each token, not just what the hash said to do) thats like beyond stupid

#

but as a result its fast and wastes no 'capacity' 🙂

#

coming soon to a RWKV near you ™️

still grail
#

like its turned into an entire freaking pre-release for hlb-gpt that has nothing to do with moes (not really at least, lolz)

#

and for some reason i have cleaned up the entire freaking codebase and annotated it more

#

this has be about one of the strangest tangents maybe i've taken in my life but well, here we are lol

#

hopefully it pans out, doing tuning experiments now that wikitext-103 is back online :3 ❤️

dawn vine
#

lol! okay well sounds productive in some direction at least!!!

fallen spear
still grail
granite plover
#

fighting the urge to buy the wikitext-103-raw-v1.zip domain just to serve it

still grail
#

i love the community redundancy so freaking much, lolz ❤️ ❤️ ❤️ ❤️ :'))))

dawn vine
#

sorry @still grail I figure this is maybe the right place to discuss MoE-like and FFN size related things after all 😁

#

anyway I was interested to see you chose 2x ffn in HLB-GPT as the default

#

I'm about to embark on a larger 1.5B training run for my RWKV-6 MoE using hash routing

#

seems to work great, and I use a somewhat unusual configuration where I put MoE as a second FFN after the normal pretrained one, but additively (it starts at zero contribution and I use this to continue a pretrained non-MoE model)

#

would be great to hear where you landed on your MoE experimentation and if you ended up with any of the tree-like stuff in there

#

and I think this is all interesting to consider relatively with regard to Mixture of Depths

still grail
dawn vine
#

cool, thanks!

still grail
# dawn vine anyway I was interested to see you chose 2x ffn in HLB-GPT as the default

you might rock your own socks off at this, but this is actually technically a 1x parameter! 1x for local, 1x for remote. I couldn't find a configuration that performed better, twiddling the v_dim and the expand_dims around (univariate + multivariate experimentation case) didn't seem to do so well. some of that might just be linear/blocksize stuff/whatever, some of it might be something else, it felt more like a 'something else' than a raw hardware efficiency thing as runtimes seemed to be not too dissimilar between them.

The one direction that did seem to have the 'lowest loss' of performance was increasing expand dim, and I think increasing v_dim? However, the ratio between v_dim and the qk_dim should always be 8, the dim reduction parameter ties the qk_dim to the normal dim so you might have to fiddle a bit there if so

#

this network seems to really, really, really like going a bit narrower and deeper for some reason

still grail
#

I did do some initial tree stuff, but nothing too complicated, and it didn't pass my complexity and/or speed requirement test, so I put it back on the shelf until I could think of something with a similar vibe that was super efficient

still grail
dawn vine
#

oh maybe this V size vs QK is related to not using attention heads!

still grail
dawn vine
#

ah I grok your meaning of local vs remote now

still grail
#

I'm honestly not sure, that thought is still like 35-45% thought out or whatever

dawn vine
#

the local vs remote part seems to me related to the stuff @boreal moss had discussed with me and then here earlier about splitting off a subpart of the embedding to do attention on

#

so that you can do 'smaller attention' on a huge embedding

still grail
#

yes it is also exceedingly computation efficient

dawn vine
still grail
#

many, many fewer kernel launches, and im guessing the training stability is much better too as you're not having to forcibly pass things through a longer residual for certain kinds of operations

#

though I don't have a good example at hand that shows that exactly

#

just sort of, if you had some kind of feature that required 3 attention operations or whatever, having to pass through 3 residuals instead of 6 bodes extremely well for not just computation time, but also training stability I think

#

that's again a bit arbitrary and not necessarily grounded in the reality of what's happening there, just trying to paint a bit of a rough picture of some of the reasoning there

dawn vine
#

yeah i tend not to care too much about kernel launches bc I pretty much require torch.compile which does away with that as an issue if done right
but stability thru fewer layer-like things could certainly be a benefit

#

and it still is probably quite a bit faster since it can use GPU parallelism more fully

still grail
#

well some of my experiments are just like ~40-100 seconds or so which is less than a lot of compiles so for now at least im staying flexy (am experimenting with some potential partial compiles for the longer runs tho

still grail
dawn vine
#

yeah hey im very into rapid iteration - i feel like no one else but you and I like that in the ML world 🤣

still grail
#

i have no idea why, the days where I run 4 long experiments vs 200-300 fast experiments are like so night and day different lol

dawn vine
#

agreed 100%

still grail
#

long as the proxy scales (and my workflow is built to try to make sure that it does)

dawn vine
#

yeah that part im unsure about tbh

#

there are a ton of things ive found which work amazing until like 1-2GTok in

still grail
still grail
#

well part of this too I think was thinking about the transformer scaling issues

#

the linear attention value I think doesn't scale precisely because it is linear

#

it can't suppress stuff

#

so in order to do so the network has to absolutely spike

#

and only the large models seem to handle that

still grail
dawn vine
#

are you talking about linear attention? or the value of attention in traditional softmax dot product self attention?

still grail
#

linear values

#

at least that's my hypothesis for why the spiking occurs

#

i really should check to see if that happens in these networks as well

dawn vine
#

we dont find a lot of problems in training stability for RWKV btw

still grail
#

that's lovely

dawn vine
#

theres a lot of normalization that goes on which may help with that

#

and linear_attention/linear transformer is just kind of fundamentally different too

still grail
dawn vine
still grail
# dawn vine and linear_attention/linear transformer is just kind of fundamentally different ...

well also, the data-dependent gating seems to be an interesting/novel thing, i was reading through your code to try to understand it a bit better, and the double-lerp "check if trajectory is here based upon some data-dependent scaling, then scale the delta in trajectory by some lookup value if so" (if I understood that code correctly) method seemed really interesting, for example

still grail
#

I feel like I might become a state space model fanatic if I understood the whole general idea/had a good toolkit behind it. Sorta gotta still work on wrapping my head around that.

dawn vine
still grail
dawn vine
#

SSM is real simple conceptually if you need a quick explainer...

#

took me a while to understand from the papers tho, theyre a real bear

still grail
dawn vine
#

re RWKV-6 I tried to add some other explainer stuff to the paper too so maybe its readable enough even tho its not code exactly - and the recurrent formulation math looks just like two lines of code, which is nice

dawn vine
still grail
dawn vine
#

cool, did you try using different ratios for the ffn side vs att?

still grail
#

shared norm layer and all of that

#

(which i like a lot)

still grail
dawn vine
#

ah i see

still grail
dawn vine
#

The one direction that did seem to have the 'lowest loss' of performance was increasing expand dim, and I think increasing v_dim

still grail
#

the way the v_dim is written is to try to encourage the user/etc to think of the whole thing as a shared space with different allocations for different places

#

yep

#

they all seemed to perform not quite as well, but that one seemed to have the least proportional hit to the number of parameters being changed

#

(i think i resized the net though to make sure the overall number remained roughly the same, not sure tho)

dawn vine
#

yeah so this gets back to the topic of this thread, which is what the proper expansion ratio really is

still grail
#

yep

#

it might be different for different things

#

at least for the 125M model as well, 1+1 seems to be holding strong on the more information dense wiki task

#

but maybe for like open pretraining, it will need a wider ratio

#

but it really does like going deep

#

the 125M model is like 28-32 layers deep lol

dawn vine
#

i kind of think your way of doing this is better and is like what mamba should be doing

still grail
#

XD

still grail
dawn vine
#

they end up doing 'attention' on the fully expanded version (2x or whatever) so its slow

still grail
#

oh interesting

#

yeah at least for the keys i dont think its as necessary

#

which honestly makes sense, you do a lower dim check to see if you should move the activations, then the activation moving should take up the bulk of the work

dawn vine
still grail
#

if your precision on the lookups is super high then relative to the amount of work we're spending on the actual activations themselves it seems a bit wasteful

still grail
#

yeah

dawn vine
#

just think of SSM as attention here

#

conv is akin to the LERP we do in RWKV

still grail
#

okay, gotcha

dawn vine
#

its just like a kernel size 3 Conv1D

#

can ignore it

still grail
#

oh dang thats tiny

#

wow

#

so basically stacking lots of local stuff

#

similar to the shifting business

#

(like you were saying earlier)

dawn vine
#

yeah same idea, Bo uses it to allow there to be induction heads in a single layer

#

but its not relevant to anything we're discussing particularly

#

just a cheap added boost

still grail
#

gotcha

dawn vine
#

but you can see there how they expand 2x (or whatever size) up front

still grail
#

well it sort of looks like you're forced to keep the higher dims in this motif, as the sigmoid gate post-SSM is also in that higher dimension

dawn vine
#

in exchange for this they use 2x the layercount btw

#

bc they no longer have a separate FFN

still grail
#

2x or half?

dawn vine
#

2x!

still grail
#

oh

#

interesting

#

i did try a no-ffns motif

#

it honestly was not too terrible

#

not the best but really not terrible

#

having that split seems to be useful somehow

dawn vine
#

yeah i eventually got this style to work fine but it was slower

still grail
#

(and maybe cheaper too XD)

#

I would love to see this split latent space idea come to recurrent networks

dawn vine
still grail
#

ive been thinking about the MOE+this layer version of things

#

maybe its just as simple as a silly expert gating thing

#

maybe not tho XD

dawn vine
#

well i have VERY simple crazy code now for moe that doesnt do much work

still grail
dawn vine
#

yeah MoE on attention is nearly impossible

still grail
dawn vine
still grail
dawn vine
#

attention is looking back at past tokens, but if you make those tokens vary depending on the expert, thats... hard to imagine an implementation

still grail
#

yeah i was thinking about it sort of from the ODE-like interpretation of transformers

#

if you think about it as a process slowly moving vectors from one place to another, and each layer is specialized at one part of the process, then you honestly would have to switch them all at once, or not at all

#

and, the other layers sort of depend on that together as well (the mlp layers i thinks.... ❤️ :')))) )

#

I think the traditional MoE style of doing things assumes a very particular style of 'do XYZ routing locally per layer', which i don't think can (easily, at least) capture this more global kind of context to it

#

so i guess from that, I feel like moes on attention could work if done in an 'all-at-once', unified kind of manner

#

as you are at that point mainly just making a '1-of-n' selection of possible trajectory sets over the course of the entire network, from layer 0 to the output layer

#

i dont think it'll work otherwise really, as the local moes for the attention layers will be nearly almost nonsensical w.r.t. each other

#

that being said, i dont think that version of an explanation really works all that well for explaining how the FFN moes work, or would work w.r.t. a more globally-sliced attention MOE

#

(if that makes sense at all.... ❤️ :'))))

#

I'm sure there'd be some interesting set of slicing dimensions where you could slice almost vertically across attention MoE layers (subsets dimensionally, across the entire vertical depth of the network) instead of horizontally (big giant chunks at each block?), if this motif worked?

dawn vine
#

i think i gotta spend some time adjusting my brain to thinking about time-related MoE before I can have a reasonable thought about it 🤣

still grail
#

It might not, but I feel like that would sorta at least be a semi-required starting point unless there is some fantastic trickery going on here

still grail
dawn vine
# dawn vine

getting back to this flowchart for a second you do your gating before the attention which is an interesting difference

fallen spear
#

there's a paper where they gate an entire layer ahead of time

dawn vine
#

we do ours afterwards, and that's kind of more standard now

fallen spear
#

it worked fine

#

they did this for resource reasons, it gives you time to fetch the expert from cache

dawn vine
fallen spear
#

so you make your decision, perform other operations and run your expert un-caching in parallel

#

rant_about_hash_routing.jpg

#

most things work, routing is black magic

dawn vine
still grail
fallen spear
#

so it runs iirc

#

gate -> some_ff > attn > gated_ff

#

but for every ff

#

i might have that paper somewhere, not sure

#

they did this specifically because they were trying to use an moe while very memory constrained so the extra time between the gate and the gated ff was for fetching it

#

especially: allowed them to have more experts in colder RAM

#

no wonder i keep forgetting golang trivia i am just filling my entire long term memory with this sort of thing

fallen spear
#

when running non batched mostly

#

in batch regime you need roughly all of your experts so this gets you nothing

dawn vine
#

O ok I am mostly thinking about train time

#

still I guess this has bearing on when gating can occur profitably

fallen spear
#

yeah they were running some model under fairly absurd resource constraints

#

i would be surprised if it matters a lot where you put your gates relative to the MoE layer

dawn vine
# still grail fair enough, XD these are just the scattered ramblings of a slightly-woozy, high...
#

they keep K and V the same across head experts, but allow Q and O to vary, to make it feasible

dawn vine
#

conveniently, this setup works just as well for linear transformers!

soft bobcat
#

@rose vapor wrote a paper on GLU variants: #research message

#

I personally did not find sin to work in any circumstance in LLMs

#

it's interesting it was the best one tested

dawn vine
#

yeah saw that, was curious

#

@rose vapor did you ever try FOUR layers? (three multiplies)

soft bobcat
#

the specific format he chose, I think would also result in gradient explosion in my tests

#

because the oscillations become very steep away from x=0

#

oh, he only tested applying one activation but not 3, rip

rose vapor
#

I didn't have the budget to try other data modalities other than vision, but I provide a code snippet to implement the SinGLU function

#

I suspect the results I got are probably very dependent on the data augments common in vision transformers such as label smoothing etc.

#

I'm not sure what the equivalents are in language, sorry.

rose vapor
dawn vine
#

x1 * x2 * x3 * x4

rose vapor
#

Ahh, nope

#

Just 1 - 3

#

With only one being passed through an activation as this is what SwiGLU does under the hood

dawn vine
#

yeah we tried a lot of this stuff early on in this channel on language modelling, you can probably look back and see it all

rose vapor
#

Awesome, will do

soft bobcat
#

his test results are also interesting. this one sometimes does well

rose vapor
#

Yeah the results sort of couple by number of matricies. I suspect this is because of the general shape of the output

#

With 1st order GLUs having a linear quality and 2nd order being sort of parabolic

#

Which order works best depends on the activation, with 2nd orders working well for Sigmoid, but 1st orders working well for Sin and Tanh

#

With sin it's easy to understand why. The first matrix does frequency modulation and the second does amplitude.

#

I suspect the Pre-LayerNorm is crucial for SinGLU to work. I have numerical experiments showing this mostly kills oscillations in the loss landscape

soft bobcat
#

so you managed to restrict the input of singlu to a region around x=0. that prevents the huge gradient oscillations

rose vapor
#

Exactly!