#d_ff/d_model + swiglu tests

1 messages · Page 1 of 1 (latest)

fallen spear
#

Created so I can stop spamming #research and #off-topic with it

#

original thread in #off-topic begins here: #off-topic message

#

thread in #research : #research message

#

Hypothesis to test:

  1. d_ff/d_model is optimal at exactly 1
  2. swiglu and geglu are worse than gelu, and specifically only appeared to be better because the adjustment for isoflops brings d_ff/d_model closer to one
glacial schooner
#

in the swiglu/geglu case they adjust the FF ratio without changing param count and without changing the fraction of params allocated to FF vs attn

#

(bc the meaning of "FF ratio" changes with those activations)

#

with a non-GLU FF, if you change the ratio, then either you change total param count or you change the FF vs attn balance

#

so it's not obvious to me that "use the FF ratio from the GLU paper, but without the GLU" has a clear meaning

fallen spear
#

Proposed test:
Retrain pythia with d_ff/d_model from {1, 2} and activation of {gelu, swiglu, geglu}
Will do smallest pythia first

glacial schooner
#

so, not keeping the param count identical?

fallen spear
#

not the number of parameters

#

isoflops is a bad test

#

isoflops requires you to alter at least two hparams at once

#

in this case specifically if swiglu/geglu are worse or equal they have more parameters

#

so they really should be better

glacial schooner
#

if only d_ff is changing, it seems intuitive that larger d_ff would perform better (up to some point where training goes unstable or smth). bc the larger d_ff model "could" always just ignore some of its neurons, if that were optimal

#

but ofc that's just a guess.

fallen spear
#

Okay I am tired of typing the whole expression so I am going to call d_ff/d_model “hidden ratio” henceforth

#

I have seen zero empirical evidence that hidden ratio above one is good

#

So far as I can tell it was set at four in the original transformers paper and never questioned

#

Swiglu/geglu modify it to maintain isoflops

#

For weak indica that one is the optimal ratio see images attached to this post

#

Goal is to get stronger indica

glacial schooner
#

makes sense, if you're keeping total params constant

#

(which was true in both of those images)

#

otherwise, higher ratio = more params, and it seems hard for more params to hurt given the general smooth scaling properties of transformers

#

another way to look at it is: adding a new FF block is not too different from doubling the size of an existing one. like if the depth is already large, and representations are changing smoothly across layers

fallen spear
#

The new paper says it keeps d_model constant

glacial schooner
#

so it's hard to have a situation where half of the FF neurons at layer N are actively harmful, without it also being the case that if you added layer N+1, its entire FF would be harmful. which would be wild

glacial schooner
#

hence "41M," etc labels

fallen spear
#

I’m happy with a totally flat line across hidden ratio

#

It proves the point cleanly

fallen spear
fallen spear
#

In fact, cleaner to show inflection:

granite plover
#

Could remake this plot with the x axis in log scale?

fallen spear
#

the uptick in the biggest model in ppl as you reduce dim is the only datapoint that looks at all favorable, the others only gain ppl after the hidden size drops below the model size

dawn vine
#

regarding the stated hypothesis, from personal experience (including a single training run I just did to make sure) for small models (say 123m params) ff_ratio>1 is better than ff_ratio==1 when keeping everything else the same and training for the same number of tokens
whether or not it's a worthwhile tradeoff in place of other kinds of resizing, I don't know

#

maybe if you adjust for the amount of layers to make the same total parameter count, ff_ratio=1 is optimal; that's what impact_depth_width_results.csv implies to me

dawn vine
# fallen spear Hypothesis to test: 1) d_ff/d_model is optimal at exactly 1 2) swiglu and geglu ...

Ran three short 50mm token tests, and if adjusted for parameters by adding layers, your hypothesis won:

params=123m ff_ratio=3 n_layers=12 d_model=768 loss=3.664
params=123m ff_ratio=1 n_layers=18 d_model=768 loss=3.656
params=123m ff_ratio=1 n_layers=12 d_model=900 loss=3.666

But with caveats: despite having the same parameter count as the first model, the 18 layer model was about 25% slower in actual training time per token

tardy trench
dawn vine
#

yeah, the other thing to take into account is that mundane things like autograd may cause your memory usage to be a lot higher with 18 layers than 12. Mine certainly was.

tardy trench
dawn vine
#

Despite the slowdown, I think this ff_ratio=1 may be a good tradeoff, at least at lower scales... it seemed a lot more effective than adding parameters in more traditional ways in terms of speed versus loss improvement!

fallen spear
#

will also increase attn cost but that amortizes with scale

dawn vine
#

argh very sorry, I quoted bad numbers above.. the result is the same but it's a closer call than I thought

#

will update in a minute

#

fixed above:

params=123m ff_ratio=3 n_layers=12 d_model=768 loss=3.664
params=123m ff_ratio=1 n_layers=18 d_model=768 loss=3.656
params=123m ff_ratio=1 n_layers=12 d_model=900 loss=3.666
fallen spear
#

i think param count off?

dawn vine
#

yeah sorry i keep making bad edits

#

it was right above, i just tried to copy/paste and deleted part and retyped it incorrectly 🙂

fallen spear
#

you good, i just don’t wanna be reasoning from off numbers

dawn vine
#

even slower with bigger d_model

#

oh one other thing I should mention, this is a little nonstandard - it's using essentially the rwkv ffn and my own modified attention sublayer so take it with a grain of salt

fallen spear
#

i'm going to guess it has to break the matmul into two ops

dawn vine
#

but it should be a reasonable approximation

fallen spear
#

main reason to use a standard transformer is just for the 1:1 comparison

#

does make it harder to reason about why it would be breaking the matmul up though

dawn vine
#

yeah i just had this lying around and figured id run one experiment.. then that became three

fallen spear
#

i am blessed with bad local hardware which gives me the go-get-a-runtime friction to prevent me from yoloing

dawn vine
#

its easy for me to run it on standard models - i have them implemented too - just had the data from a run here already

dawn vine
fallen spear
dawn vine
#

yeah, ive used jarvis and vast mostly

#

agreed about the 0 availability, it sucks

#

i almost went out and bought a 4090 a couple months ago bc of that

#

i may still

fallen spear
#

i am broke but tempted to buy a used a100 on credit which is not a good impulse

dawn vine
#

4090 is fine if you can get along with 24gb

#

roughly same speed as a100 in my experience

fallen spear
#

memory constraint limits some stuff but fair enough

#

… wait I do have access to a 4090

#

here’s hoping he’s awake

dawn vine
#

surprisingly, it looks like the larger d_model version is only about as good as the original wide ffn

#

updated the chart above

#

well, that was interesting!

fallen spear
#

thanks, it does suggest pretty strongly that 1 isn’t actually the good number

soft bobcat
#

do you have estimates of the standard deviations of these losses, even if they're just manual guesses?

dawn vine
soft bobcat
#

no, that's perfect. it's exactly the kind of plot that we can stare at to eyeball std devs

dawn vine
#

well, interestingly 1 was the good number here, if you didn't care about runtime and adjusted for parameters by adding layers

#

I didn't necessarily expect that, though I really had no idea what to expect!

#

all I knew going in was that 1 is severely suboptimal if you can just add size to the FFN without worrying about cost

fallen spear
#

i think the paper that set me off did that tradeoff and it continues to be true until about 400m params with ratio at about 2

#

ratio below that not tested for that many params

#

i went nuts because i’ve been bothered by a lack of ablations to d_ff for a while

dawn vine
#

makes sense - i was definitely interested when I saw this thread and thought about it at all

fallen spear
#

gotta say, the relatively small gain in loss for throwing out 75% of the parameters still an interesting result

#

but those are some nice and steady losses, it’s impressive actually

dawn vine
#

its on a small randomized part of the pile validation set

#

the training loss chart looks a lot choppier 😉

fallen spear
#

lol fair

soft bobcat
#

since I'm studying learning rates right now, I'll comment that I believe learning rate is supposed to decrease when depth increases

fallen spear
#

loss goes up a smidge as they max out on depth towards the end

#

hparams identical across all runs

dawn vine
fallen spear
#

yeah uh

#

this table looks real noisy to me

fallen spear
#

is this actually telling me the effective rank of a ff layer of this size is less than ten

fallen spear
dawn vine
#

what's interesting about it is, that at 12 layers (for this d_model=134m), it seems 'optimal' in that you pack as much linearly independent info as possible into each layer

#

[probably not] coincidentally, that's exactly how many layers people usually use for such a d_model

fallen spear
dawn vine
#

the main result I see is that it looked like in both 134m and 374m they were optimal ppl at around 2.66x ff_ratio, but a lower ff_ratio more like 1.5 for the tiny 41m model

#

which interestingly does not at all match my tests from last night

#

"When 𝑑model/𝑑ff > 1 (red dashed rule), perplexity slowly increases"

#

I don't see that as being true in their data at all

#

perplexity looks clearly lower to the left of that red line for all of the models

#

not hugely, but it's still lower!

fallen spear
#

and the csv above it

dawn vine
#

yeah, im talking about the ratio not the ff size which is what your plot shows

fallen spear
#

i am trapped on mobile but i think the only one that gets worse before crossing ratio of 1 is the biggest one

#

yeah you have to go get d_model to figure where it should be on that plot

#

for each curve

#

can also check the csv for it

#

i am ignoring every metric besides validation ppl tbh

dawn vine
#

oh i didnt expand the csv, only saw first few rows... looking now

fallen spear
#

i guess the middle size gets slightly worse

#

but they also … add four layers

dawn vine
#

wow my eyeballing was almost exactly right

#

best ratio for the biggest ones was around 2.66

#

smallest one very hard to tell from the data

#

but likely somewhere between 0.7 and 1.4

#

as mentioned, this is directly in opposition to my test results last night

#

which showed 1.0 as being useful when increasing layers accordingly

#

(w caveat of slowdown in training)

fallen spear
#

i would suspect hparam issue for depth

#

but also, that’s a different and weaker hypothesis to test than that hidden dim is useless or only marginally useful

dawn vine
#

i just ran more tests w/ different LR for the deep model.. limited evidence but lower LR was not better

dawn vine
#

but in reality I'm not sure I can train fast enough with deeper models to make it 'worthwhile' to go to 1.0

fallen spear
#

specifically bc it harms both training and inference time like that, yeah

dawn vine
#

got other ideas besides increasing d_model? that didn't pan out for me

fallen spear
#

switch activations to something fancier

#

add attn heads

#

too involved to test on a lark, but: add attn sink tokens

dawn vine
#

yeah increased attn size/heads could be helpful

fallen spear
#

which doesn’t add params but does increase flops

dawn vine
#

i will try more heads (so bigger total attn size)

fallen spear
#

can do moe with four experts at ratio 1 for same cost

#

eight experts if only the down projection is an expert

#

i guess i am assuming no params in routing function or that such params are negligible

dawn vine
#

i cant try MoE easily, and personally I'm more interested in practical ways to increase function of the base transformer than adding different architectures like that into the mix

#

but i can try heads/attention sizing easily

#

that's not to say MoE isn't a valid idea

fallen spear
#

both moes and factorization are hobby horses i’d want to keep as last resorts

#

bc they don’t test the hypothesis cleanly

#

entirely separate cans of worms

fallen spear
#

but much like quantization that could only work after training is done

#

so i’m not chasing it because it doesn’t test the hypothesis meaningfully

#

oh you could also have up to four separate ff + geglu for equiparams, four times slower though

dawn vine
#

I ran a new test with ff_ratio=1 and more attention heads instead of more layers and it did the best so far

dawn vine
#

my next question is: if you always use ff_ratio=1, can you skip the hidden dimension entirely and just run your activation function on the input and project to output via a single linear layer?

fallen spear
dawn vine
#

my testing agrees with you 🙂

fallen spear
#

... what is the most convoluted activation function

#

i think the answer is "attention", actually

fallen spear
dawn vine
fallen spear
#

which is very silly

#

and also slow

dawn vine
#

haha

#

well, being rwkv-like, this has all included a gating on the FFN as well

#

so it sort of does

fallen spear
#

and throwing a qkv in there would do that

#

but i think we're verging on trolling <_<

dawn vine
#

trolling ourselves lol

#

it's not a perfect test bc of parameter count mismatch but removing the gate in exchange for another layer was not a good trade

#

(just tried that)

#

my best revised hypothesis so far, from my testing results, is that hidden_dim=1 is a good trade for more heads, but that it may or may not be actually worth it in terms of overall training time slowdown

#

that slowdown will increase as you increase context length

#

whereas a larger hidden_dim is unrelated to context length

#

as a result, adding layers could be a more worthwhile trade, but might be more likely to cause you vram problems

fallen spear
#

i’m gonna do a 1:1 pythia run at ratio 2 and 1 when i can

#

i might only go to a couple checkpoints on each

dawn vine
fallen spear
#

#research message

#

it was spy

#

there are other confounding variables maybe idk

dawn vine
#

yeah not quite what I was thinking of

fallen spear
dawn vine
#

but you're right, it wouldn't apply for rwkv

feral cairn
#

So what's the upshot of the current tests @dawn vine @fallen spear

dawn vine
#

so where exactly the tradeoff becomes worthwhile is unclear, and probably dependent on a variety of factors (e.g. for attention, more heads will cost a lot for large context lengths, and for many layers vram or even convergence may become a problem)

feral cairn
dawn vine
fallen spear
#

I’m looking at a straight pythia when I can set up an env which did not occur last night

dawn vine
#

was just trying to get a feel for the outcomes so I used whatever I had handy, but ended up doing a bunch of tests once it got interesting 🤣

#

so ymmv with more traditional models or other ffn styles

fallen spear
#

"larger d_ff does literally nothing" doesn't seem like it's doing well as a hypothesis although I'd be curious how it behaves at higher d_model especially, "putting the d_ff params somewhere else is usually better" looks pretty good to me rn

#

although, honestly: the fact that ablating d_ff only hurts it a little bit in this case is still an interesting result

#

most ablations hurt ... more than that, I think?

#

on a completed result if it's got a flat and steady ppl gap like it does in the preliminary test I'd want to calculate how much more training would close it since it is ablating approx 75% of params and that is a plausibly worthwhile tradeoff

dawn vine
fallen spear
dawn vine
#

this thing about the rank being only 10 is still hurting my brain

soft bobcat
#

this is a sparsity idea inspired by the hidden dimension expansion. is there a way to be intermediate between (hidden=input dimension) and (hidden = 4x input dimension)? sure, let's suppose the hidden dimension is 4x the input dimension.

divide the hidden dimensions into 4 blocks, and label them 00 01 10 11.
divide the input dimensions into two sections: A and B.

section A outputs to 11 and 01. section B outputs to 11 and 10. block 00 receives no inputs; it's just there for notational convenience. as a consequence, each neuron outputs to half the hidden dimensions. the output connections are left fully connected for now.

#

here's the intuition: the binary code of the hidden dimensions says "what coordinates I will accept input from". 11 takes input from everything, so it's fully connected. 00 takes input from nothing, so it's not connected. the 11 block reproduces the ff_ratio=1 topology. the other dimensions are for passing through parameters without being able to process them, if that's what the extra dimensions are doing.

in the original transformer model, the purpose of those extra dimensions is probably to expand the number of nonlinearities that can be created. but maybe some of those nonlinearities don't require the full flexibility of all N dimensions; that is what the 10 and 01 blocks are for, as they do partial processing.

the division into A and B is hacky and mathematically unpleasant, so it won't do much more than passing through parameters.

the division can be extended, such as into 3-digit binary codes + sections ABC. in the limit, the construction actually makes sense both mathematically and computationally, since hidden dimensions correspond to tuples of input dimensions. but we're nowhere close to that.

dawn vine
#

my idea was more like this:
I adjusted a ff_ratio=1 FFN to learn additive differences i.e. instead of w_out(w_in(x)) it's w_out(x + w_in(x))
that still worked pretty good!
so then lora it i.e. w_out(x + w_in_b(w_in_a(x))) where w_in_a and w_in_b bring it down to 1/4 size and back

#

or maybe skip the w_out entirely and run the activation function in the middle of the lora sandwich

#

so far the lora part hasn't panned out very well 🙂

#

but I may have bad initialization

soft bobcat
# dawn vine i like the direction you're going, but I'm a little unclear on what blocks 01 an...

suppose there was no nonlinearity. then ff_ratio = 1 would be all that is necessary. since ff_ratio > 1 is creating benefits, that means there needs to be nonlinearities in more hyperplanes than there are input dimensions. but consider what these nonlinearities look like: most likely some are very nonlinear, and some are mostly linear. a 01 block has a reduced ability to express a nonlinearity - so it's more appropriate for a mostly linear axis.

#

so if we look at it from an expressibility point of view, if we have a mix of nonlinear and linear things, we put the linear things in the 10 and 01 buckets, which frees up the 11 buckets to express the nonlinear things. the 10/01 buckets are cheap and less powerful, good for linear things

#

and since we're in the high-dimensional regime, being expressible could mean that it's trainable too

dawn vine
soft bobcat
#

activation function on all three

#

consider what an activation function on 10 can do: its hyperplane can only handle dimensions from B. that kind of sucks. but that may be good enough for some nonlinearities, and the really hard nonlinearities train their way into being in 11 instead

fallen spear
#

the final result for models might generally be low rank but gradients are not necessarily of low rank

shut badge
#

So the original work here seems to claim that for a fixed parameter count you want more depth, rather than ff expansion. I was excited to try using a deeper but "thinner' model, but then I realized that the deeper + thinner model is going to be insanely more computationally expensive. Having very wide MLP's is good because it's very efficient and very parallelizable. So anyways, reducing MLP width a lot didn't even afford enough compute to add one more Attn+MLP while being compute equivalent

#

I really dislike "parameter equivalent" comparisons that ignore the compute requirement. No free lunch here.

#

If anything, my benchmarking lends a lot of credence to PaLM 1' very high 8x expansion.

fallen spear
#

that’d be an interesting experiment too

#

but: parameter equivalent ignores compute (and anything else weird you do while doing it), compute equivalent ignores inference time (and maybe something else I am missing). ablation is nice because it lets you check exact amount of degradation along some axis

#

... actually that's an interesting perspective, computing isoflops against their param-matched thing, presumably there's a minima for perf/flops on their graph somewhere

#

i guess your objection isn’t compute so much as wall clock time; the projection adds little or no time because it’s nicely parallel and without it the ff is a lot thinner than attn so it’s a relative dead spot in the utilization

fallen spear
#

current status

dawn vine
fallen spear
fallen spear
#

but: still leaves the ff pass as an underutilized block of time in the gpu

#

i am increasingly convinced to find more ways to jam parameters into the activation function

dawn vine
fallen spear
#

i am definitely still woolgathering for what function makes more sense though

#

basically geglu/swiglu replace an activation function of one variable with an activation function of two, if you assume up projection is the worst option you should be able to go from one big matrix to doing an element wise multiplication of four matrices with (virtually any) nonlinearity applied to them and beat the up projection as an operation

#

i am trying to consider only one index in the activation and think of functions of three or four scalar variables that have any obviously desirable or intuitive properties like how geglu/swiglu are “gating”

dawn vine
#

Then why not just use two ff_ratio=1 ffn w gates in a row

fallen spear
dawn vine
#

I guess that's still fewer parameters bc gate is only 1 way

fallen spear
#

alternatively do (gate + gate + gate) * nongate

#

my brain wants there to be an elegant analytic thing that makes some kind of sense to eat up params

fallen spear
#

other possibilities: maxout, multiply nonlinear gates together

feral cairn
#

Have y'all read the muP / hyperparameter transfer work? I'm wondering if it implies we don't need to search through the whole search space for every model size

soft bobcat
dawn vine
fallen spear
fallen spear
fallen spear
#

but: given, as a confounding variable, that they are also varying depth: that doesn’t appear to be the case for existing work

#

it is possible it’s only scale dependent at smaller scales

fallen spear
fallen spear
#

it is so simple

soft bobcat
#

mixture of 1D activation functions + bilinear GLU = can express any product of usual functions, as well as their sum. such as polynomial approximation, or x^2 e^-x, etc

#

plus function composition, like e^(-x^2), composing x->e^x and x->x^2

fallen spear
soft bobcat
#

you need two layers in this case

#

honestly, it's really like a universal function approximator. take any expression, decompose it into single operations (such as by Reverse Polish Notation). then each layer computes one more term of the expression

fallen spear
#

i can see how to do it with successive layers, i'm a little more iffy because that adds depth

soft bobcat
#

yeah, no need to increase depth here. for a single layer, improvements should already exist

#

well, maybe. the benefit of having different activation functions in a layer is second order, which means the benefit is stronger in longer training runs than in shorter ones. but I wouldn't change the length of a training run just to inspect this change

#

here, what I mean by first order is "this neuron is outputting 5 but it should output 10", and then it moves from 5 to 10 linearly. second order means the change is through curvature rather than slope

fallen spear
#

That makes sense to me, it is sort of interesting which mathematical points are clear and which aren't

fallen spear
#

and: it would have been concurrent work with gpt-neox, not sure if there's a strong case for gpt-neox to use the initialization etc it does instead of this one

dawn vine
soft bobcat
# dawn vine To clarify, does gelu(linear(x)) * sigmoid(linear_gate(x)) meet this definition ...

here's an illustration of what I mean by second order: if you have points (0, 0), (1, 1), (2, 4), then a linear neuron can try to fit these points, and a quadratic neuron can fit them. the competition between these two neurons is second order. it takes time to train weight away from the linear neuron into the quadratic neuron. being second order in this context is a bad thing; it just means training is slower.

my proposal is to have a mixture of existing well-performing activation functions within a single layer (like ReLU and GLU), rather than to increase the dimension of an activation function.

dawn vine
soft bobcat
#

in terms of achieving diversity, yes. having two activation functions is better than one.

#

there is an unpleasant niggle there where multiplying two functions causes quadratic behavior, which causes gradient explosion (to a small extent) and hence should slow training down a bit. so gating may not be free. but tests seem to indicate it's worth it anyway

dawn vine
#

Yeah I've seen it always be worth it

#

But if two is good, is three better?

#

Because gating is already a somewhat standard FFN improvement and maybe it works for the reason you outlined

#

Also, people already reduce the ffn width to accommodate the gate's parameter increase

#

Which is exactly the kind of thing we are discussing

soft bobcat
#

a three-gated neuron could be better, but it's just speculation from my part. I don't have a mathematical understanding of how to balance the tiny gradient expansion issue with the improvement from diversity

dawn vine
#

Your point about gradient explosion is well taken

#

I like the idea of allowing the ffn to simulate a more complex function tho

#

Via say a fourth order polynomial

soft bobcat
#

conceptually, mixing one-dimension and two-dimension activation functions (especially bilinear rather than GeGLU) is very similar to having skip-connections

fallen spear
#

turns out i had corrupt files from git lfs and the loading script was crashing silently :P

#

going to do a manual parity check once current lfs is done since apparently lfs doesn't have a parity check???

fallen spear
#

uhhhh so i finally did the math correctly

#

if you do mxm instead of mx4m projection you save 6m^2 parameters

#

and then if you jam all those params into your up projection, your inner dimension is 6m, and you can then run a complicated nonlinearity on them to reduce the resulting state down to size m

#

giving you a d_ff to d_model of 6 if you count from before the nonlinearity

#

and 1 if you count after

#

at equiflops

#

i was wondering why palm reported a ratio of 6

dawn vine
#

im missing something, don't really understand what parameters are going where in that

#

how are the experiments going?

fallen spear
#

you take an input vector of size m, you run one m by m6 matrix or equivalently six separate m by m matrices

#

you perform (some nonlinearity that reduces size of vector by a factor of six) and then you apply another m by m matrix

#

you have d_ff/d_model of six if you count before the nonlinearity is applied and are at equal params to standard transformer attention

#

palm reports a ratio of six and as far as i’m aware nobody knows what nonlinearity they use or why it’s six

#

so … gonna go with that theory

#

it’s sort of a natural extension of swiglu or geglu

shut badge
#

this kinda reminds me of the sigmoid gating in RWKV

fallen spear
#

realized you could also gate the residual connection

dawn vine
#

Only way it worked was t, 2-t

fallen spear
#

Can you break down how it was gated in more detail? Not sure I get it.

dawn vine
#

Sure.


t = residual_mix(x)
x = x * t + fn(x) * (2-t)
#

No guarantees it works in a more general setting, and at least one person tried it and claimed the gating ended up stuck at like .9 for all channels

#

But it consistently gave me a small loss benefit in training. Never looked into it enough to know if that was still true at validation or test time

dawn vine
fallen spear
#

(on the plus side i should have a really clean environment i can reuse whenever i climb out of my current hole)

fallen spear
#

todo: figure out how folks are doing their residual connections, I can use some of the literally six different matrices we have to play with to gate and scale it, probably with a swiglu or geglu. will have fundamentally different call signature inasmuch as normally the FeedForward is self-contained

shut badge
#

I can try some of these tomorrow!

fallen spear
dawn vine
fallen spear
#

the ones to beat are geglu and swiglu since they’re currently considered sota

#

my guess is going to be that several of these are better bc the params are better literally anywhere but in a down proj

#

(geglu and swiglu not represented in that gist)

fallen spear
#

writing all those out gave me more ideas

boreal moss
#

more than 3 parallel layers multiplied together makes it perform worse, adding any activation function anywhere makes it worse, I was testing on <1M parameter models and it was better than gelu by a large margin, doesn't look like it holds for bigger models

fallen spear
#

there is a basic problem where there are too many different things you can do so specificity is helpful

soft bobcat
#

I expect this to be significantly easier to train than YinLU: pooled = torch.sin(x_1) + torch.exp(-torch.pow(x_2, 2)) + torch.tanh(x_3) + torch.nn.functional.gelu(x_4) + x_5 * x_6. the reason why is that multiplying lots of random variables together will cause gradient imbalance (= gradient explosion, even if normalized by layer norm), since occasionally the multiplication will be very big and usually not so big.

soft bobcat
#

of course, the reason why the original activation function is called "YinLU" is because Gyges deserves all the credit for the idea, so by Stigler's law, I got the title

fallen spear
#

i propose fourierlu: sum of sin(x_1*x_n) for n from 2 to 6

#

i cannot write it right this second

fallen spear
#

i can at least remember that yinlu is “precisely that function that was transcribed verbatim from a message by kevin yin”

#

i have no idea which is which for the others

fallen spear
soft bobcat
#

yes, inspecting the activations and L2 weights of the final network is a good idea. in my original plan, which is just a mixture of different neurons inside a layer, I would dynamically change the proportions of the activation functions to reward the ones that did better for a problem. but since all 6 activation functions are summed together in your network, their proportions are fixed to 1/6, so it's all-or-nothing

#

note that it's the nonlinearity you care about inspecting, not exactly the zeroing out: what separates one activation from another is whether the inputs cause it to travel along a wide range and express a diversity of its outputs. if the inputs stay within a small domain (like [-0.1, 0.1] for cos(x)), the activation function isn't special; it looks just like a parabola over that stretch, just like any other activation function

fallen spear
fallen spear
#

i guess unless "collapse of activations to zero" is one of the collapse modes

soft bobcat
#

performance of fourierlu will probably suck though, for the same reason that sigmoid fell out of favor

#

there's an alternative Fourier formulation, where you take x_1 and x_2 as a complex number x_1 + i x_2, and same for x_3 and x_4, and x_5 and x_6. so it becomes a complex exponential e^(z_1 z_2) + e^(z_1 z_3). however, e^(x) may be a very crazy activation function, and it also produces two outputs instead of one. although exp is the more natural fourier structure, unless exp(x) behaves well (I doubt it), it will also not work

fallen spear
#

but then i think maybe it reduces to an inner product

fallen spear
soft bobcat
fallen spear
#

i think also you can rescale your linear layer for cases where you end up below or above 2pi?

soft bobcat
fallen spear
#

My concern would be numerical stability if any of the X end up very large, having perhaps gone through many cycles of pi

soft bobcat
#

with this set of weights and biases, you can't shift by 2 pi anywhere

#

sin((w_1 x_1 + b_1)(w_2 x_2 + b_2) + 2pi) gives the same value, but there isn't a way to merge the 2pi into any of the constants

#

whereas if we had sin((w_1 x_1 + b_1) + 2pi), we could write b = b_1 + 2pi and do a shift

fallen spear
#

huh, maybe it would need normalization of input then

#

“move normalization into the ff layer” was not on my bingo card

soft bobcat
#

I don't see how the input could get large either. it's a sum of many sines (all capped at 1), times a dense matrix. so the input x will only get as large as the terms in the dense matrix

#

oh wait, the input is from the attention layer, not another fourierlu layer

fallen spear
#

well, attn + linear

#

i would suspect that empirically with reasonable init and reasonable lr it will not be a problem

#

but it's hard to rule out completely

soft bobcat
#

my knowledge of norm functions is also pure guessing, and my intuition doesn't work

#

sometimes batch norm is needed, sometimes layer norm, etc

fallen spear
#

my other concern is zeroing out inputs, especially if x_1 manages to become zero

#

since sin(0) is 0

#

i dimly suspect slight nudge of noise might be a good idea

soft bobcat
#

you could change sin to cos. maybe it will make a difference, maybe not

fallen spear
#

i think same problem, at input 0 you have constant output and no gradient

soft bobcat
#

right, cos is even worse

#

so sin(x_1 x_2) is effectively bilinear when x_1 or x_2 is near 0

fallen spear
#

i think in some high level way they are the same problem and have the same solution, which I might be predisposed to see as gauss for fern reasons

soft bobcat
#

nah: sine is bilinear when x_1 or x_2 is near 0. but cosine is fixed to 1 when either x_1 or x_2 is near 0 and the non-zero variable has no effect. it's a quadratic vs linear problem

fallen spear
#

i read like four papers about trigonometric activations and cannot remember if they addressed this specific thing which seems very prominent when just looking at it mathematically

#

they're pretty opinionated about initializations so worth thinking of

soft bobcat
#

this paper illustrates how ReLU can only represent low-frequency details efficiently

fallen spear
#

i think all of these are interesting from InnerProductPooling on down

shut badge
#

Any prioritization of which ones you think are most interesting?

#

That's a lot of ideas!

#

I'll start with the first one, except i plan to use an expansion of 4*4/3 instead of 6 to keep it consistent with my baseline.

fallen spear
#

6 should be equiparams with a standard ratio of four

#

basically: the extra params are coming out of the down projection

#

i guess we could also push the extra params into the down projection

shut badge
#

my baseline is geglu which was equiparams with gelu using expansion of 4

#

oh, wait, yea I see 🙂

fallen spear
#

yes, equiparams with that in this regime should be 6 up and 1 down

soft bobcat
#

is there a linear layer before AveragePoolingWithGelu?

fallen spear
fallen spear
#

a significant fraction of the activations in the middle of the file are “why not”

#

and being unable to come up with why not

#

anything that multiplies in more than three things is almost certainly a bad idea

#

the max and average pool towards the beginning are almost certainly not good but might still beat geglu

soft bobcat
#

but this is an LLM; how could average pooling possibly help? adjacent neurons are not related to each other

fallen spear
#

my rationale is basically that geglu does not make any sense either and anything which pulls params away from the width of the down/up proj is probably an improvement by default

#

i don’t see a good, rational reason for avg pool to work

fallen spear
shut badge
#

I'll try YinLU, inner product pool, fourier pool, and one of the geglus first

fallen spear
#

since it’s additive instead of multiplicative it seems less likely to explode

shut badge
soft bobcat
#

average pooling does not change what is expressible; it only changes the training dynamics. whatever function is expressed with dense matrix->avg pool->GeLU can also be expressed by (dense matrix with avg pool baked in)->GeLU. similarly, you can reverse the transformation with (dense matrix with avg pool inverse)->avg pool->GeLU to produce just (dense matrix) -> GeLU. that means AveragePoolingWithGelu is effectively just a GeLU layer. in practice, the training will be different since it's imposing an inductive bias on training of "nearby neurons have some similarity", but that's not an inductive bias that is appropriate in the setting

#

I would also have to think about the training dynamics; it's possible that the gradient undoes the avg pool, but I have no intuition about that

fallen spear
shut badge
fallen spear
soft bobcat
#

up-projection does improve expressivity of the nonlinearity. for example, consider a 1D input, then a super-wide hidden dimension with ReLU activations. the hidden dimension lets you fit a piecewise linear function, and the number of kinks you're allowed in the piecewise linear fit is the dimension

#

or, consider Fourier activation. the larger the hidden dimension, the higher the coefficients in the sines can be, and hence the sharper the fit can be (because it can express higher frequencies)

fallen spear
#

i am convinced that avg pool is probably not even worth testing

#

but: you can have a piecewise linear function with (say) up to 4,000 kinks, does it benefit you if you spend 4x as much compute and can in theory fit one of 16,000 kinks?

#

or, more clearly: does it benefit you a lot?

soft bobcat
#

my intuition here is that training becomes a problem first. especially the asymptotics: fitting x^2 with a ton of kinks will be a big mess. it's simply not the appropriate activation

#

so even if you fit y = x^2 for x in [-100, 100] perfectly with 4000 kinks, you will never have the right asymptotic and then the fit will naturally become very bad at x=1000

fallen spear
#

https://arxiv.org/abs/2002.05202v1 I read this in a good amount of detail and the indication seems to be that basically anything other than straight up and down projection is preferable

#

but: it limits itself to nonlinearities of two variables, from which we get the now-sota geglu/swiglu

soft bobcat
#

I also share that takeaway, but be careful with the statistics. the p-values are probably not even 0.05

fallen spear
#

i don't think we can even meaningfully calculate a p-value on them

#

kat specifically says that her experience has been that geglu beats swiglu even though that paper has swiglu win

soft bobcat
#

sure you can, there are two distributions: 1-variable and 2-variable activations. you can calculate standard deviations, and there's a statistical tool that lets you test if two sample distributions have different means

soft bobcat
#

so according to normal statistical standards, this paper doesn't meet the bar for significance. it's still a good paper because of the cost of the experiments

fallen spear
#

so to reiterate motivation: based on papers linked up top with a fair degree of apparent reproducibility, pulling params out of ff up/down and putting them into depth generally wins at scales of I think 50-million-ish params, undertested (imho) regimes are: simply ablating those params outright and checking what the impact is as scale increases, using weird pooling-ish activation functions that extend on the regime used by geglu/swiglu (as stuffed into that gist), pulling the d_ff params and using them to increase d_model, stuffing those params into extra attention heads, and (occurred-to-me-today): using some of those params to gate the residual connection

#

weird pooling functions are just kind of the most compelling because it's straightforward to test with and also it scratches the itch to do weird math

#

i guess: and if it works it presents the same "clear win" criteria that swiglu/geglu do where in principle you aren't trading off anything

#

also, at scale the up/down proj dominates model RAM footprint and it seems incredibly silly to me if they are not serving a very good purpose

#

the idea that llama is a 70b model and could profitably be a 20b model at negligible loss and there isn't meaningful empirical testing to say otherwise kind of offends me

fallen spear
#

also, the fact that palm has no published crunchy technical details but reports a d_ff/d_model of six makes me suspect they are doing something like this to do a more complex nonlinearity that ultimately reduces the activation size to d_model, because six is the ratio you'd get if you tried to do this and remain at equiparams

#

which i figured out ... a day or two ago, even though I'd been obsessing about this for some time

#

so i would actually guess that google internal research has already done this and found it to beat swiglu/geglu

#

i have resisted the urge to tag the google/deepmind folks in here to ask them

#

it seems like that would be impolite

soft bobcat
fallen spear
#

it is frustrating that there is no mathematically obvious way to do it, which puts us into the "there are really an infinite number of six-to-one functions" regime

fallen spear
fallen spear
#

like, one of the linked papers has them at ten by their measuring method

#

it seems bizarre that an m by 4m matrix, when trained, has an effective rank of ten if the projection to 4m is doing anything important

boreal moss
#
"ff1"
fc1 = nn.Linear(64, 256)
fc2 = nn.Linear(256, 64)
x = F.gelu(fc1(x))
x = fc2(x)


"ff2"
fc1 = nn.Linear(64, 128)
fc2 = nn.Linear(64, 128)
fc3 = nn.Linear(64, 128)
fc4 = nn.Linear(128, 64)
x = fc1(x) * fc2(x) * fc3(x)
x = fc4(x)


"ff3"
fc1 = nn.Linear(64, 128)
fc2 = nn.Linear(64, 128)
fc3 = nn.Linear(64, 128)
fc4 = nn.Linear(128, 64)
x = fc1(x) * fc2(x) * F.gelu(fc3(x))
x = fc4(x)
fallen spear
#

fc2 in ff1 looks backwards to me, i am guessing it is meant to be baseline gelu ff block?

soft bobcat
#

ff3 improving more than ff2 at the far end shows a behavior that should be pretty typical: the gelu breaks symmetry, so it has more expressivity, but training that expressivity is second-order rather than first-order. so the benefit from symmetry breakage appears only after a longer period of training

fallen spear
#

that gives me an itch to put gelu gates on the fourier pool

soft bobcat
#

I also notice from this chart that all the training samples are loaded in the same order for each training run, which is ok

boreal moss
#

yes

#

also ff2 and ff3 curves are a little more smooth than ff1

soft bobcat
boreal moss
#

I see it, but to be sure I would have to actually calculate it 🤣

fallen spear
#

i added a gated fourier pool to the gist just for fun

#

could also put the gate around the entire thing though

boreal moss
#

this is very special case with small ffns, it was 1D causal convolutional network with 8 layers and task was character level next token prediction on tiny stories

fallen spear
boreal moss
#

but my point is that you all should try those multiplicative ffns without any additional activation functions

fallen spear
#

with the exception of outright max/avg pooling layers i am kind of agnostic about which of these functions make any sense, i think it mostly makes sense to test those that are analogous to existing workable functions and/or that have good diversity with each other

#

"no actual activation function" definitely fits the bill

#

but: there's a whole body of stuff about how small scale stuff benefits from modifications that are neutral or bad at higher scale

boreal moss
#

yes

#

from my experience it looks like in very small scale more expressive activation functions can make huge difference on the order of adding 25% more parameters and for larger scale the same modifications makes almost no difference, like if big models have no use of more expressive activation functions

fallen spear
#

the currently-sota swiglu/geglu stuff is, as has been pointed out, relatively statistically marginal

shut badge
#

Alright I had to redo the baseline training but now I've started the first exp, you can see the progress at https://wandb.ai/jonm/MLPs

#

Since avg pool looks like it's not going to do as well, will probably cut it off early.

soft bobcat
#

yes, there's no reason to believe that avg pool will do well

fallen spear
#

i am surprised it works at all

#

... also surprised that yinlu looks so nice so far

shut badge
#

Should have implemented eval perplexity though, token accuracy is probably a bad metric

#

either way the val loss should be instructive

fallen spear
#

was gonna say, unless you are planning on running it rather long and it's rather big they shouldn't differ a lot

#

trajectories tend to be the same unless data is so small you can overfit or training is so good you grok, I think?

shut badge
#

Yeah, I mean, sometimes I have seen things do slowly converge to a better result, so I prefer to wait a bit longer. But these are fairly short runs overall as far as typical llm training goes

#

Will try fourier next

fallen spear
#

fourier is my special boy and the one i will be the most sad about if it doesn't work

#

i have no good reason to believe it will work, and this feeling is entirely irrational

soft bobcat
#

YinLU2 starts at really high loss. I'm pretty sure the reason is the torch.exp(-torch.pow(x_2, 2)) term, which is 1 at x=0

soft bobcat
#

I think that means that y = 0 must be true when x = 0. either the NN intrinsically wants this property, or initialization wants this property. so YinLU2 is bad, and it would make more sense to use "pooled = torch.sin(x_1) + x_2 + torch.tanh(x_3) + torch.nn.functional.gelu(x_4) + x_5 * x_6", where the Gaussian is replaced by a simple no-activation passthrough

#

alternatively, torch.exp(-torch.pow(x_2, 2)) - 1 satisfies y = 0 when x = 0

fallen spear
#

yinlu’s 3 and 4 are born

soft bobcat
#

one interesting result from the tests is how well YinLU 1 trained in the initial stages. it looks like layer norm lets you get away with quite a bit of multiplication

fallen spear
#

tbh enough of them are doing well enough that i suspect we’re not getting stellar signal, which is a positive result but doesn’t narrow our options a ton

soft bobcat
#

which graph are you focusing on? I'm using loss/val

fallen spear
#

assuming my slight colorblindness doesn’t hurt me too much here, we get: baseline, trigeglu, inner, yinlu2

#

that yinlu2 has such an obvious improvement and is potentially good for analysis makes it nicest prospect at the moment i think

#

i will ignore this and spend all my time trying to figure out how to make sin activation happen

soft bobcat
fallen spear
#

huh. it does. wild

#

i think it’s the only one that stacks products that high, no?

soft bobcat
#

of the ones tested, yes

fallen spear
#

my guess would be that it’s either one of the specific activations or the act of doing so many products, the diversity of functions is neat but does not seem like it can possibly be optimal

soft bobcat
#

it can't be a specific activation; activations can only suck but they can't be amazing

fallen spear
#

then, at least early, the cumulative product is very good

#

a very silly result

#

i am starting to wonder if there is a level of ridiculousness that cannot possibly be a good idea

#

sum of power set of all products is clearly too ridiculous

#

in principle it needs to be an O(n) op

soft bobcat
#

power set of products of a, b, c is just (a+1)(b+1)(c+1)

fallen spear
#

… someday I will be good at math

#

So we can actually just do that

soft bobcat
#

learning from the Gaussian issue, the correct activations should be instead "(a+1)(b+1)(c+1)-1" and "abc", where abc is from trombocyt

#

the difference between these two is that the first formula (with 3 inputs) is quadratic linear when a, b, c are near 0. and the second formula is cubic

soft bobcat
boreal moss
#
fc1 = nn.Linear(64, 64)
fc2 = nn.Linear(64, 64)
fc3 = nn.Linear(64, 64)
fc4 = nn.Linear(64, 64)
fcout = nn.Linear(256, 64)

x1 = fc1(x)*fc2(x)*fc3(x)
x2 = fc1(x)*fc2(x)*fc4(x)
x3 = fc1(x)*fc3(x)*fc4(x)
x4 = fc2(x)*fc3(x)*fc4(x)
x = fcout(torch.cat((x1, x2, x3, x4), dim=-1))

this weirdo is just a little better in small scale than "tri" linear

fallen spear
#

i remain somewhat vexed that there is no obviously correct way to pile together params in this way

#

i guess i can find another good defense of stacking as many products together as possible

#

matmul already does addition

#

any function you can represent by addition is already well represented

fallen spear
shut badge
#

Generally seems some stuff is about as good as geglu. This is a strong baseline already (already better than vanilla GPT2 by a good margin), so not bad!

shut badge
fallen spear
# shut badge yes

the first gist is too long so I am stuffing my extra brainworms into another one

#

it will have that and two more yinlus

#

and a power-product thing and two more fouriers

#

because ??? reasons

shut badge
#

If there's anything from the original list i should try as well lmk

fallen spear
#

i feel like the trigeglu and inner product are trying to tell me something but i don't know what

#

the other geglu derivatives might give us some clue about what is working and why

#

and, I guess, product pool, let me check to make sure I wrote it reasonably

fallen spear
shut badge
#

Seems the loss still starts pretty high with YinLU3

fallen spear
soft bobcat
#

right, the failure of this activation function isn't because of f(0) = 1, it's something else

fallen spear
soft bobcat
#

this activation function is just bad somehow

shut badge
#

sin, tanh, and and gelu all have different output ranges, does it make sense to add them together?

soft bobcat
#

it does, but my guess is that sin and tanh are so bad that they don't make sense even in a mixture

shut badge
#

er, well sin/tanh and gelu do

soft bobcat
#

so forget about YinLU4, it's also almost certainly terrible

fallen spear
#

i suspect baselining against product pool makes sense since it just forgoes the actual functions and does multiplication

#

... and powerproductpool

#

"eliminate effects besides the product operation", more or less

soft bobcat
fallen spear
#

we could maybe make a later note to ablate it to verify which of the components is harming it

#

on the "activation functions can't be awesome, they can only suck" theory

shut badge
#

This seems like something that could be done through neural architecture search (not that I have the compute resources for that or anything)

fallen spear
#

... hardmaru did something like that

#

I am coming at this from the "what sort of seems to make sense to me mathematically" direction but that is possibly the wrong direction

shut badge
#

perhaps i should try this too as another baseline

fallen spear
#

five trainable parameters is a good number for us

soft bobcat
#

them freezing the UAF parameters in the middle is very strange

#

"This is done to reduce the over-fitting of the model and to prevent training instability", which means they saw training instability. why would this instability occur?

fallen spear
#

what are they actually training and does it even have normalization on it

#

NNs are naturally unstable, instability in a sandbox problem tells us very little

#

he VGG 8 layer CNN

#

yeah

#

don't worry about their experiments imho, their math looks incredibly clever though

#

it worked once in a tiny sandbox and the math in principle approximates any activation, good enough for me

#

... grad students at uvic

shut badge
#
        act = (
            torch.log(1 + torch.exp(x_1 * (x + x_2) * x_3 * x**2))
            - torch.log(1 + torch.exp(x_4 * (x - x_2)))
            + x_5 * x_6
        )
fallen spear
#

i think you can just use the sixth param as the input to the activation

shut badge
#

o yea

fallen spear
#

but i do not yet understand their math

fallen spear
#

"we did something incredibly clever on a zero resource budget" thank god for resource constraints thank you so much

#

yeah, this is a much smarter approach to the problem than "make up equations and see if one of them eventually sticks"

shut badge
#

And yeah no need to worry about training stability I think, pre-norm transformers are incredibly stable at this scale

#

famous last words though

fallen spear
#

i believe firmly in jinxing myself as hard as possible any time i feel the impulse

#

are you really living if everything you say doesn't sound like famous last words

shut badge
#

another note is that all the linear layers are initialized with orthogonal matrices. I could imagine with some of these activation functions that you might want to initialize each (dim, dim) chunk of the (dim, dim*6) matrix differently depending on how each of the 6 components is used in the activation

fallen spear
#

every time i contemplate the possible effects of initialization on these oddball activations it makes me twitch

#

is a good future todo though

fallen spear
#

i like waves okay

#

... it is possible some of the failed experiments can be salvaged with initializations specific to them but there is no strong reason to choose any of them especially to do this with

fallen spear
#

it looks like it died

#

which is a shame, because it is so clever

#

if there are any of these worth salvaging, that one is worth salvaging

#

actually I think that's a NaN which is an odd result, we might need to add an epsilon to something

shut badge
#

lol, so that UAF did actually diverge. definitely jinxed it

fallen spear
#

i think i see at least part of issue

#

taking x_1 as the input, should be:

act = torch.log(1 + torch.exp(x_2*(x_1 + x_3) + x_3*torch.pow(x_1, 2))) - torch.log(1 + torch.exp(x_4*(x_1-x_3))) + x_6

#

had an extra multiplication in the first exponent which should have been an addition, may have made it prone to diverging

#

or maybe it's diverging anyway because exponents be like that

#

who can say

#

not entirely clear it makes sense to have a trainable activation function per index instead of a trainable activation function for the entire layer

#

but 🤷

dawn vine
#

Lol so complicated! Why not just make it resemble a 4 term polynomial like out(a(x)v(x) + b(x)v(x)^2 + c(x)v(x)^3 + d(x)v(x)^4) where abcdv and out are your six nn.linears

boreal moss
#

I was playing with weird parametric activation functions, performance was highly dependent on initialization, those were non-monotonic functions and they just got stuck if in the path from initialization parameters to optimal parameters function was changing number of minima and maxima

fallen spear
#

it’s not great

#

worth noting because i am sort of stumped: our constraints are total trainable params of 16*d_model^2 and two sequential matmuls + i guess some wiggle time equivalent to elementwise operations, it doesn’t necessarily need to be strictly a trainable linear up projection to 6 times as large then dim reducing nonlinearity to 1 then matmul

#

i am sort of thinking of doing the up projection first non-trainably and then down-proj

soft bobcat
#

has PReLU ever worked in real life? (i.e. in practical models)

dawn vine
fallen spear
#

worth testing

dawn vine
#

another option is to vary the activation function by channel/segment, which can give almost the identical polynomial result with way less work

fallen spear
dawn vine
#

the result is that the initial proj_in chooses what gets treated with which exponent

#

and the final proj_out can simulate the addition of the components of the polynomial

#

you can even vary the percentage dedicated to each exponent (including exponent zero)

#
c = self.coefficients # just some trainable parameters, these don't need to vary based on x
v = Wvariables(x)
y = cat([c[0], c[1]*v, c[2]*(v**2), c[3]*(v**3), c[4]*(v**4) etc. ])
return Wout(y)
fallen spear
#

oh i get it

#

that’s clever

dawn vine
#

probably way better usage of rank

fallen spear
#

i guess: in principle it doesn’t matter at all if we have a trainable vector in the layer

#

because it is so small compared to a linear layer

dawn vine
#

well im just trying to let it train in whatever function approximators it likes using as few parameters as possible

fallen spear
#

sure, but what is v coming from?

#

my first impulse is “just make it a trainable vector”

dawn vine
#

yes those W are trainable weights

#

who knows what it wants to do hehe

dawn vine
#

i changed it a bunch above

#

basic idea is we don't need to spend 4*n^2 just generating coefficients for polynomials, since each set of possible coefficients just represents a different function

#

actually, they don't have to come from x at all

#

so uh isn't this just essentially the simplest possible trainable activation function lol

fallen spear
#

possibly

fallen spear
#

I’m out for most of the next week probably fwiw

exotic musk
exotic musk
#

oh, it seems a long time! sorry.

exotic musk
dawn vine
#

will let ya know in a few minutes hopefully i'll have some results... just was too busy to try til just now

dawn vine
#

saw some folks using that as a replacement for relu

#

seems to work slightly worse than my existing activation but broadly similar

#

this is the code I tried:

class LearnedPolynomial(nn.Module):
    def __init__(self, dim:int):
        super().__init__()
        self.c0 = nn.Parameter(torch.zeros(dim))
        self.c1 = nn.Parameter(torch.ones(dim))
        self.c2 = nn.Parameter(torch.ones(dim))
        self.c3 = nn.Parameter(torch.zeros(dim))
        self.c4 = nn.Parameter(torch.zeros(dim))
    def forward(self, x):
        y = self.c0 + self.c1*x + self.c2*(x**2) + self.c3*(x**3) + self.c4*(x**4)
        return y
soft bobcat
#

typo on x**3 next to c4

dawn vine
#

good catch

#

rerunning with that bugfix

#

first test was almost indistinguishable from my 'standard' rwkv-style channel mix relu^2 activation (this is on a model using traditional attention tho)

soft bobcat
#

my guess is y = GLUGeLU(old y formula inside) will do a bit better. the logic is to create a flattening at the negative end. but my idea could be nonsense

dawn vine
#

the concat?

soft bobcat
dawn vine
#

ah

#

WOW its winning now that i fixed the bug!

#

good catch indeed

#

cant believe this worked, even limitedly

soft bobcat
#

another possible reason for GeLU: it increases expressibility by one degree of freedom. for the raw polynomial, if you scale all the input weights and decrease all the output weights, then the overall net is the same. so some of the parameters are redundant.

caveat: ReLU has this same issue and everybody liked ReLU for a long time

#

crap, I meant gelu instead of glu: the gaussian error unit

exotic musk
#

@dawn vine Just replace the GeLU to GeLU(LearnedPolynomial(1)(\cdot)) in transformer, train it simply, and let's have a look!

#

0,1,0,0,0 to make no change as intitial results

dawn vine
#

well this is all confounded by me using rwkv style ffn at the moment
i can run others ez, say llama2 or whatever, but i dont have the baselines already run

#

to be clear, the rwkv one works better in my experience

exotic musk
#

rwkv is OK. no significant difference, I think

dawn vine
#

well it normally uses relu^2 activation

#

anyway i can try gelu around it

#

it started out strong without gelu an ended up being almost identical to the classic relu^2 version over time

exotic musk
#

oh... relu^2 is relu(relu()), or max(0,x^2)? I have not seen it before

soft bobcat
#

max(0, x)^2

dawn vine
#

gelupoly winning, but i forgot to zero out the x^2 term

exotic musk
#

curious about the results, and the learned values from c0 to c4

dawn vine
#

so may be extremely hard to visualize

soft bobcat
#

it can still be useful to do statistics: if nothing else, it helps initialization

thick briar
#

this is fascinating 👀

dawn vine
#

the other idea here was that this kind of activation may allow us to ditch the expansion/contraction in the FFN

#

because it lets it model things that otherwise required that

thick briar
dawn vine
#

the expressive power is really why I thought this might be a useful way to reduce FFN size

soft bobcat
exotic musk
soft bobcat
dawn vine
#

exactly!

#

i gotta go to sleep, but I can pick this up again tomorrowish

exotic musk
#

good night

dawn vine
#

haha straight gelu won by a bit

#

gnite@

exotic musk
dawn vine
#

I did some followup experiments, nothing really new to report... but I did try a very interesting FFN that didn't do well:
(this is just a sketch of it)

    self.coeffficients = nn.Parameter(torch.linspace(0,1,D))
    def forward(self, x : Tensor):
        return self.w_shrink(torch.cat([self.coeffficients.expand_as(x), x, x**2, x**3], dim=-1))
#

the idea was along the lines of my original thought, which is that the projection should be able to mix various polynomials out of the ingredients in that concatenated set of learned coefficients and x raised to various powers

fallen spear
#

the only one in the gist that i have a lot of hope in presently is powerproductpool, maybe add a gelu gate to it

fallen spear
#

on reflection so I don't stuff this into research:

#

it's possible arithmetic intensity on the up projection is low if it triggers swaps so that's another axis of tradeoff

#

worth tracking, anyway

fallen spear
# fallen spear it's possible arithmetic intensity on the up projection is low if it triggers sw...

that is to say: it is possible that even though the feedforward looks, mathematically, like it is a low-memory operation compared to attention (which has three entire matrices), so your utilization will be closer to 100% if you make your feedforward stage wider, the up-projection to x4 will trigger cache misses and so eat up a bunch of time swapping stuff into and out of vram when an x1 would not

#

so you'd gain on wall-clock time substantially when gearing it lower

soft bobcat
#

I was thinking about effective rank, because if FF layers are undergoing rank collapse, might as well add -(effective rank) to the loss. but I think it's not effective at capturing rank collapse, because it's dominated by the single highest singular value.

#

if one singular value is bigger than all the others, the effective rank will be low. but I doubt that matters

#

the reason I became interested in rank collapse is this paper, which says attention-only transformers, without FF, collapse to rank 1: https://arxiv.org/abs/2103.03404

#

another paper tries to rearrange the attention and feedforwards, and I'm considering how the two papers interact

dawn vine
#

How about vs where they're the same unified layer, e.g. mamba

fallen spear
#

iirc there are a couple of ways to do it, none of them are actually svd

fallen spear
#

i confess to not having gone as deeply into that as i would like to have

soft bobcat
dawn vine
#

Mamba didn't invent it, but basically they expand 2x then use that like v then gate and contract
So the attention equivalent is done at 2x
But residual is still 1x
No separate ffn

soft bobcat
#

oh hmm, I read it wrong. I thought it was dividing by the first eigenvalue

fallen spear
fallen spear
soft bobcat
#

it appears erank just always gives low numbers. here's an example: I specify an infinite list of eigenvalues 1/x^2. that seems perfectly reasonable and our matrix isn't collapsed or anything

#

the sum of all these eigenvalues is pi^2/6, so I divide by this.

#

and the effective rank I get is 4.94

#

so this innocuous distribution of eigenvalues has very low effective rank

fallen spear
#

weird definition

soft bobcat
#

I actually like it other than the fact that it seems to have low informational value

#

it scales the eigenvalues so they sum to 1. then checks the Shannon entropy. all of these are natural operations

#

I think it's just really sensitive to the highest eigenvalues. because in my pi^2/6 example, the first eigenvalue, 1, takes 60% of the whole distribution

fallen spear
#

i will have to think about it, I am not entirely sure how much sense it makes to measure those things

soft bobcat
#

I decided it does make sense. if the input is a random vector, the effective rank should give you a good idea of how many ranks you need

#

so for a matrix with effective rank 10, using a 100-dimensional low-rank approximation should work ok

#

it's only if the input vector is non-random (such as if it avoids the eigenvector with the high eigenvalue) that effective rank breaks down

fallen spear
#

there should be some function which takes a matrix and its effective rank and gives back a low rank matrix that is a lossless approximation

soft bobcat
#

my guess is it's already been tested and didn't work, so now I'm thinking about why

soft bobcat
fallen spear
#

would this work if we had a trainable eff_rank size matrix matmul'd into a static actual_rank size orthonormal matrix

#

(or: some function of eff_rank size)

soft bobcat
soft bobcat
soft bobcat
#

so even though the effective rank is small, the whole matrix matters

fallen spear
soft bobcat
#

this was the point of this channel

#

low effective rank suggests that what is needed is a very complex activation function, but not a large dimension

#

Smerky tried that. no improvements

fallen spear
#

oh, yes, that

#

i think it needs more trying

#

but also i am being picky about my env

soft bobcat
#

nanoGPT might work, but I haven't tried, it's on my backlog

fallen spear
#

i will probably hack and slash a few repos together into something i feel good about and that fits gracefully on my local

#

i would do pythia but trying to build it gave me hives

#

will probably mimic existing pythia model size/settings tho

dawn vine
fallen spear
dawn vine
#

oh btw I also just got a new model working that's relevant here bc it uses a variation on Gated Attention Unit (GAU) 2202.10447
which is a combination of FFN and Attention into a single layer with only a 2x expansion... uses fewer params so you can double the layer count

#

took some fiddling, but now I have it really performing well (versus equivalents w/ separate 3x expansion FFNs)

dawn vine
#

having some significant success even with 1.5x expansion with this method where you get to increase the number of layers

#

makes me wonder if multiple layers in a row of smaller expansion FFNs would have been a more effective use of parameters than traditional transformer alternation of wider ffn and atn

fallen spear
#

you have to do them sequentially though, no?

#

i ask because: equiflops, but increases depth of computation and therefore time, maybe

#

in general depth has always been better than width the question is whether there's a regime with no tradeoff

fallen spear
#

i think i am personally very convinced that parameters are better off literally anywhere but in the ff

thick briar
#

Have you seen eqbench? It's a new benchmark for emotional comprehension and intelligence. The Mixtral model, which otherwise matches or surpasses similarly sized models in benchmarks like MMLU, etc., performs barely better than the Mistral models 8x smaller than it

#

I hypothesize that this is due to the lack of depth. 32 layers is just not enough for complex emotional understanding.

#

As you said, depth is better than width in pretty much every way, and while deeper models are harder to train, they're also more parameter-efficient. I believe the next open-source models will be much deeper than the ones that we already have.

#

As an aside, another advantage of scaling depth instead of width is that it gives SSM/other efficient LM architectures a larger hidden state than they would have otherwise.

dawn vine
dawn vine
#

meaning that we can't get rid of channel mixing, but we can do it with a much less wasteful algorithm

#

that's my hypothesis for why moving parameters away from it has seemed helpful to date

fallen spear
#

i am not up to date on the semi-RNNs bouncing around, if any of their mixer steps can plausibly be done with a low parameter budget one of their functions works as what we're calling an "activation function"

#

and also a half baked theory: it is possible sensitivity to d_ff/d_model would be highly sensitive to initialization. bad initializations will benefit more from the upscale, beacuse it is more likely that some of the parameters are closer to a good initialization since there are more of them. a good initialization may benefit less

boreal moss
#

Most likely most of what ffn does is just memorizing stuff, so I doubt that in language models anything will be **much **better than MLP

fallen spear
#

it seems almost optimally inefficient to me

fallen spear
soft bobcat
#

if the nonlinearity is fixed as e^(i c x) for trainable weight c, then 4x up projection gets you 4x the spectrum. although a trainable nonlinearity could probably accomplish similar things

fallen spear
#

is that accurate

soft bobcat
#

yes. it's like the second half is a rotation, then a nonlinearity in the rotated direction

boreal moss
#

linear 1x and on that 4x shift/scale and nonlinearity

fallen spear
#

if your value is below zero so is its rescale

#

you get no information from knowing that x is 3 and 3x is 9

#

you should have the same number of linear regions

soft bobcat
#

I'm not sure what you mean by value being below 0, because it's all pre-activation. you can take as an example d_model = 1, d_FF = 2, ReLU activation. then you get two kinks. if d_FF = 1, you get one kink

boreal moss
#

Just like 4 different parametric activation functions per every neuron

#

In some cases it will be more efficient but it is some kind of inductive bias

fallen spear
#

no, I am thinking of the input

#

oh, i will correct

#

so if we assume x is a vector of breadth (concretely) 4, and our FF is a 16x4, 12 of the values of FF(x) are linearly dependent on the other 4

#

if G is relu, we should have the same number of linear regions as if FF was 4x4

#

because a given value of FF(x) will be above 0 if and only if all of its rescales are also

#

so H(x) is not more expressive, there is no set of things in H(x) with 16x4 FF that couldn't be represented with a 4x4

soft bobcat
#

are you considering FF to not have weights?

fallen spear
#

no, it has weights

#

it is a normal boring very vanilla FF

#

just: it has to be collinear because it is of rank 4

soft bobcat
#

ReLU(1+x), ReLU(2+x), ReLU(3+x)...

fallen spear
#

oh, you mean bias

boreal moss
#

And what happens when you add bias

soft bobcat
#

sorry, yes, I meant bias. but note that if one of the inputs is 1, it's equivalent to having bias

fallen spear
#

okay, bias means i am wrong. i do peripherally remember that people were removing bias from their networks though

boreal moss
#

Yes and it still works

soft bobcat
#

so bias and no bias are basically the same, if you allow the NN to optionally decide to fix one x_i to always 1

boreal moss
#

Weird

#

Oh right

#

You just need one source of constant value on the input of the network and fnns can emulate bias

fallen spear
#

i had a previous theory that it might make sense to reduce activation width and just give it a set of constants

#

this came specifically from the dettmers thing about outliers

#

where it basically used outlier weights as constants

#

but that is tangential; thank you for refining my mathematical intuition for this one

fallen spear
#

intuition was that if networks were doing a lot of work to create constants you should give them constants so they could not

boreal moss
#

That's called bias 🤣

fallen spear
#

trainable bias can in theory do this but in practice has to work to do it

#

when it is so simple that it is basically a noop

#

i don't know what portion of grad has to be oriented to creating a +50 or +100 weight in a well-trained llm but it has to be some significant fraction of it if it's perpetually pushing that value up and keeping it from going down

#

bias neurons like this didn't show huge gains in a previous era but previous eras didn't have well-trained networks spontaneously deciding a single index had to always be unbelievably high

boreal moss
#

Is scale of the value making a difference in context of optimizers like Adam?

soft bobcat
#

you can just relax the weight decay for biases if you need a huge one. if you init huge biases, they will be huge, but in the wrong direction

fallen spear
#

it can ignore them if it doesn't want them

#

it consumes a negligible portion of width, e.g. 17 values for float32

#

for clarity: this is a bias activation, not a bias that ever gets applied. it's a constant value that is always there

#

so every FF would go from eg 1024x1024 to 1024 to 1007 and then you'd concat in your biases

soft bobcat
#

so the middle neurons of the FF layer are like {H(x), 1, 2, 4, 8...}?

fallen spear
#

yes, and similarly for negative powers of two

#

this is completely tangential to the previous thing

#

just thinking about using inputs as constants reminded me of it

soft bobcat
#

1, 2, 4, 8 are all equivalent as inputs; all they change is scale of gradients and weight decay

fallen spear
#

specifically allowing linear combinations of them makes all numbers representable in the next output

soft bobcat
#

but there are weights multiplied to these biases. so 1w = 2 (w/2) = 4(w/4) = ...

fallen spear
#

yes but weights are trainable and don't like to fix directly on integers or powers of two or any such direct numerical thing

soft bobcat
#

so what you mean is {H(x), 1, 2, 4, 8...} and then not multiply the constants by weights?

fallen spear
#

it made sense to just always have an available arbitrary-sized activation

#

and then if we were doing one because the network needed one for a specific numerical reason it made sense to try to do a number of them

#

not because I have any operation I specifically want to enable, because in principle you can do different operations with a different set of constants

#

because "some numerical operation that networks may find useful might become computable, or more easily computable, if they have a set of constants across scales"

fallen spear
soft bobcat
#

reading through Dettmers's post, these constants will not accomplish the same thing. because his outliers move to different dimensions over mini-batches

#

so they have actual signal

#

when you create these constants, you are "making them available" but what you're really doing is bypassing weight decay and gradient scaling for these particular dimensions

fallen spear
#

his outliers eventually fix to single indices

soft bobcat
#

in particular, if you have a 2^30 constant, the rest of your network will never train

soft bobcat
#

assuming he means "residual stream" by "hidden state", my guess is that he's using pre-norm transformers so the purpose of these outliers is to change the scale of the output of layer normalization

fallen spear
#

neat

fallen spear
dawn vine
boreal moss
#

maybe the question how compressible is trained mlp should be answered first?

fallen spear
#

i think the general answer is that they are absurdly compressible once trained but this does not mean they can be compressed when training

boreal moss
#

what makes you think that?

fallen spear
#

you can sparsify and quantize networks with relatively little performance loss

#

i would say "no performance loss" but people have apparently stopped caring if their compression techniques were lossy

#

not very long ago people still did that

#

and it generally worked

boreal moss
#

quantization yes, but that is more general thing that suggests that we use unnecesary high precision in general, sparsifying I don't think so

fallen spear
#

pretraining in lower precision doesn't work

boreal moss
#

sparsifying is more like quantization of some number of weights to 1 bit

fallen spear
#

sparsifying is throwing out those weights completely

boreal moss
#

no

#

don't you need to know which weights? 🙂

fallen spear
#

not at runtime, no

boreal moss
#

you don't understand what I'm saying or you talking about some kind of structured sparsifying?

boreal moss
#

so subset of weight matrix are zeros so those don't need to be multiplied, but you still need to store binary matrix pointing which ones are zeros

thick briar
#

How I understand the FFN is a key-value store. q (the input) is matched with k (up_proj) to produce the "attention" scores that determine which items in v (down_proj) are added to the residual stream.

The LLM stores patterns in k, and what it should think if those patterns are matched against in v.

This interpretation makes me suspect that k is overparameterized. While there are probably some patterns that are dependent on all dimensions of the input sequence, it seems more likely that the majority of patterns do not. We could have multiple separate up-projections, some dependent on some dimensions of the input sequence and some dependent on all dimensions.

#

We might also consider how LLMs might be forced to store knowledge. If FFNs are a key-value store, there are only so many keys to match against. Likely, the LLM divides knowledge into categories and assigns each category a key-value pair. When the key detects that the input is looking for knowledge from that category, it spits out all the knowledge it has about that category into the residual stream for future access.

#

This seems extraordinarily inefficient. A better way might be to have an extremely large number of keys that match with a subset of the dimensions of the input sequence, coupled with an extremely large number of values that output onto a subset of the residual stream. This would allow for the LLM to implement fine-grained knowledge storage and retrieval.

soft bobcat
#

if you don't understand the part in [ ], you can skip it

#

actually maybe not; the idea looks nonsensical unless you have intuition about how manifolds should behave

boreal moss
# thick briar This seems extraordinarily inefficient. A better way might be to have an extreme...

if every key is from a different subset it is just sparsifying and sparsifying is just a special case of quantization, if those subset are relatively small (so most of the entries in the matrix are zeros), then kolmogorov complexity has to be low, but inference on gpus will not benefit from that, only if somehow we can find some kind of structure in that sparse matrix, we can leverage that and design more efficient architecture with the right inductive biases

soft bobcat
fallen spear
#

and iirc you can usually sparsify so far you have to be zeroing out entire rows

fallen spear
soft bobcat
# fallen spear can confirm that i don't understand it

here's a non-mathy explanation: a residual stream needs to represent every possible concept that might want to be represented, simultaneously. so if we want to represent a car in the residual stream, we also have to represent that car's relationship to Paris - zero. and its eloquence when giving a speech - zero. most concepts simply aren't related. so it'd be better if we could have a low-dimensional vector, specifically for car-related things, and not have to store all the unrelated parts

#

but if we do this, then our low dimensional vectors no longer have any obvious relationship to other low dimensional vectors. we must store those relationships between pairs of related concepts. so if we have cars and doors, and we want to do Q and K stuff with them - in the regular transformer, we apply Q and K to the giant-dimensional vector. here, we must store a transition matrix that relates cars and doors to each other

fallen spear
boreal moss
#

@fallen spear I didn't read the whole thread but I see that you were using graph from https://arxiv.org/pdf/2310.19956.pdf that suggests that for small models standard mlp ratio of 4 is too big, bad and wrong, this is complete BS because their 41M model with mlp ratio=4 is two layer model and that was the reason it is junk, not the mlp ratio, for mlp ratio=1 it has 4 layers so enough to be working okish and thats why it is better then, and also I think that all their 41M models are highly suboptimal because they just set model dim for 41M parameters wrong.

soft bobcat
#

the reason swiglu underperforms geglu seems to simply be the scale of x. if you use swiglu(1.702x)/1.702, it matches geglu. and likewise, if you use swiglu(x/1.702)*1.702, it does a lot worse https://wandb.ai/ad8e/tinystories3?workspace=user-ad8e

#

however, I have to go play tennis and will be back in a few hours

#

that probably means init is really important, and that I have to figure out muP things

soft bobcat
#

most of the variants on a(x1) * b(x2) * c(x3) perform basically the same, for functions a, b, c

#

they are all much better than regular GeLU and all have comparable performance to each other and GeGLU, as long as a b c are ok-ish

#

for example, exp cannot be any of the three

soft bobcat
#

ReLU^2 is giving pretty bad results. I think because the gradient is not capped, and instead grows as |x|

#

logsigmoid(x) * sin(x) also has bad results and shares this gradient issue

soft bobcat
#

testing results on small models:
High confidence:

  1. multiplying multiple activation functions is better than not multiplying. that means these things all work and are much better than GeLU:
    GeLU(x1) * x2 = GeGLU
    GeLU(x1) * GeLU(x2)
    GeLU(x1) * sinc(x2)
    GeLU(x1) * ReLU^2(x2)
    GeLU(x1) * ELU(x2)
    x1 * logsigmoid(x2) * sin(x2)
  2. you can go up to 3 and the performance is the same, up to noise:
    tanh * x2 * logsigmoid
    gelu * x2 * elu
    GeLU * linear * linear

Medium confidence:

  1. scale of the gradient of an activation function needs to not grow too fast
    exp is unsuitable as an activation function, though different inits may change things
    quadratic (ReLU^2) may build up gradient issues. this may be bad at higher scales. same with logsigmoid(x) * sin(x). multiplying two quadratics may cause issues
#

theory speculation, unconfirmed by experiment:
symmetry seems to be bad. so GeLU(x1) * GeLU(x2) is worse than GeLU(x1) * x2. and x1 * x2 is worse than GeGLU
linear also has symmetry. maybe there would be a better activation function than linear, in GeLU
perhaps only one of the activation functions should have a 0-zone

soft bobcat
#

my current best activation function is tanh(x1) * x2^2

#

it does a little better than GeGLU, but the improvement is much less than the improvement of GeGLU > GeLU

soft bobcat
dawn vine
#

a lot of the time this stuff doesn't hold up on larger models, annoyingly

#

@boreal moss tried a ton of stuff and found some good ones for small models tho

soft bobcat
#

my models are 10M, trained for 50M chars. the benefits show early in training and then are marginal later, but persist. they're above the noise floor but they're not that important

#

I am a bit skeptical of cubic functions right now because functions with higher gradients seem to show instability

boreal moss
#

@soft bobcat Yes, but when I added to "trilinear" ffn a "very leaky" tanh looks like it tolerates higher learning rates

fc1 = nn.Linear(dim, ffdim*3)
fc2 = nn.Linear(ffdim, dim)

x = self.fc1(x)
x = torch.tanh(x) + 0.1*x
x = torch.chunk(x, 3, -1)
x = self.fc2(x[0] * x[1] * x[2])

didn't tested this extensively, just run one test and compared to no activation function for dim=256 ffdim=512

soft bobcat
#

sounds reasonable, the tanh should restrict the gradients to a reasonable range

#

the three-part FFNs I tested were no better than GeGLU; mostly, they were equivalent. maybe I didn't look in the right places

soft bobcat
#

this probably explains why the effective rank was so low

soft bobcat
soft bobcat
#

SE Gyges has been gone for a week now

icy sentinel
#

you might want to tag him on twitter if it's important, otherwise will be back when back

icy sentinel
#

what "accordingly" means I have no idea

soft bobcat
#

I mentioned Gyges a few times to send him info, which can be found by Ctrl+F mentions of his name

icy sentinel
#

Appreciate it 🙂

soft bobcat
#

all initializations for a fixed model are muP, in the old meaning of muP. because muP did not give any indication of correct inits, and only said how to transfer inits from one scale to another

#

the new muP paper does give math on how to scale params, and I'm working it out on my model now

icy sentinel
#

i would guess that for very simple modifications (ie reducing layer width) you can simply rescale to maintain the infinite-width limit

#

for less simple ones like geglu god have mercy on your soul

soft bobcat
#

I think all the transformer activation functions are pretty simple now

#

muP only tells you how to scale. it's mostly independent of your activation function

#

and what you need to figure out for the activation function is what the multiplier for that function should be. which is guesswork

#

at least, for transformers this is true. for a series of FC layers without normalization, it would be more difficult

icy sentinel
#

i think i finally have an environment i approve of as a reproducible lab for a/b tests and all i had to do was build a machine, rework the dockerization of gpt-neox, and make a new branch off of v1.0 of it

icy sentinel
icy sentinel
# boreal moss <@441658587404697600> I didn't read the whole thread but I see that you were usi...

this table is also here: https://arxiv.org/abs/2001.08361

#

if anyone has straight tested this at higher sizes I have not seen it

#

now that i have an environment i approve of i think i a/b test at ratio 4, 2, 1, with and without scaling the initialization

boreal moss
#

this one shows similar minimum and simultaneously proves my point, here ratio=0.5 is worse than ratio=4

icy sentinel
#

it is a little bit frustrating that there is relatively little good testing on this count but the information we do have does not seem to indicate strongly that the up projection has a very large effect; it has an effect, but it is not large, and allocating the same parameters to either depth or attention heads seems to outperform

dawn vine
icy sentinel
#

or at least be time consuming

#

it is a good idea though

#

i have "understand rwkv and mamba" on my todo somewhere

dawn vine
#

what are you comparing currently? pythia?

icy sentinel
#

yes

#

and by comparing i mostly mean "contributing code to gpt-neox"

dawn vine
#

what size model? blink tends to chide me that I have to test on L32D2048 or it doesnt always hold up for larger models

#

i've 'discovered' amazing things that work great on smaller models like L12D768 but die at 400m tok trained on larger ones

#

this kind of scale effect may be especially relevant here for FFN sizing

#

its pretty annoying to train a model that large tho 😦

icy sentinel
#

yeah i am trying to make sure my setup has a 1:1 larger-scale analogue so it's easier to bump upwards

#

effectively: rerunning same analysis is a matter of swapping out like three params and running it on a real node

soft bobcat
#

(I'm still figuring out depth scaling though)

icy sentinel
#

if you're figuring out mup perfectly gpt-neox needs its mupperizer fixed

icy sentinel
#

if i'm doing it nice and manually

soft bobcat
#

currently I'm waiting on my boss to see how much I'm allowed to talk about

#

gpt-neox is using the old mup repo so it will never be fixed unless someone PRs an implementation of the new spectral muP paper

#

the old mup repo is useless

winter grotto
#

i wonder why

wind wadi
winter grotto
wind wadi
#

The sum of two linear maps has dimension >= the max of both maps' ranks, and with K experts + top K sampling you're giving the model more chances (& incentive?) to separate signals from individual experts and produce a higher rank map per MoE layer. If the MoE layer projects two inputs from different time steps into two different subspaces, then the attention mechanism could restore information lost by the low rank projection(?). So I guess you're not specifically bounded by the smaller expert dimension? I'm probably wrong somewhere; I need to sleep.

wind wadi
winter grotto
#

you're probably correct i just don't know how linear algebra works

icy sentinel
icy sentinel
#

this is roughly the same trick as multihash routing just without the hash

#

you define your standard sized ffn as many smaller ones

#

it is equivalent

#

you do however get to do routing many times

#

in the limit you define experts of size one matrix row each and route separately for each

#

it is a good trick and i have been trying not to think about it bc there are too many things you can do with MoE

boreal moss
#

I have to ask, what exactly we are searching for now here?

icy sentinel
#

will probably do that again

#

i personally still want to a/b test directly simple ablation on ratio

#

i just got lost for a prolonged period when trying to ensure i had a reproducible setup

#

my standard for 'reproducible' comes from maintaining build systems and is somewhat more stringent than is considered normal or sane in ml

#

or in maintaining build systems

boreal moss
#

your point was to find best parameter efficient ratio right? I think the problem is that you can't really decouple dff/dmodel from other things, so you can't really measure what you want to measure

icy sentinel
#

i can literally just reduce the degree of up projection and it's a standalone change

#

but to stay at equiparams, yeah, you have to fiddle the entire ff setup, again similar to geglu, if you want to keep things "the same" while fiddling activations

boreal moss
# icy sentinel but to stay at equiparams, yeah, you have to fiddle the entire ff setup, again s...

There may be a useful trick on how to change the dim of the model without changing the dim of the attn layer, just calculate the attn layer on a part of the tensor, and the rest go straight through, this is used in some CNNs for other reasons and works just fine, should work on transformer, example:
for ratio=4 you use model_d=1024 ff_d=4096 normal attention layer
for ratio=1 you use model_d=2048 ff_d=2048 and put half of the tensor through attention layer, another half go straight through.

icy sentinel
#

i think we only need that if we are moving params out of the ff into the d_model and we want to leave the attn alone

#

bt noted

boreal moss
#

so, if 1x ratio is okish for standard FFN, bilinear types with the same parameter count will have 0.66x ratio, are they still better or this breaks at this point? anyone tested this case?

dawn vine
#

since the theory here is that ffn_ratio is an inefficient use of parameters, do you guys have opinions on DeepSeekMoE? it adds together the results of a ton of different low ffn_ratio networks (selected as 'fine grained expert segments') to create a high ffn_ratio network akin to a traditional high ffn_ratio expert/FFN

#

the idea being that a single FFN w/ ffn_ratio=4 is the same as the sum of 8 smaller FFNs w/ ffn_ratio=0.5

dawn vine
#

also, @fallen spear have you considered increasing d_model but reducing both FFN and attention dimensions as a potentially maximally effective use of parameters?

fallen spear
#

attn is its own can of worms

dawn vine
#

yeah but maybe a huge embedding size is whats really important

fallen spear
#

it is definitely one of the main important things

dawn vine
#

i'm doing a lot of work with MoE at the moment so I'm thinking about this stuff a lot since its inherently very FFN centric usually

fallen spear
#

that is part of what lead me to being this deranged about FFNs, yeah

dawn vine
#

do u have opinions about what I was saying about deepseekMoE above?

fallen spear
#

i have a pre-existing obsession with "hash routing", which should probably not be called that, and one of the things they did in that paper is in the same vein as deepseekmoe

#

@dawn vine #off-topic message previous rant here

#

i had another one where i was speculating about routing with an lsh, as i sometimes do and should really get around to testing

dawn vine
fallen spear
dawn vine
#

im fine with getting scooped, but i will also try it 😉

fallen spear
#

it seems like that should work, and also splitting into successively smaller and smaller merge-able experts a la deepseek/multihash should work

#

and they are both free

#

ie, they do not make the model more expensive in any meaningful way

dawn vine
#

theres literally no reason it shouldnt work as well as layers, since at worst its just the same

fallen spear
#

routing fails to train

#

for both

#

assuming routing is well-behaved, both approaches are strictly better than layers ordinarily are. maybe cache behavior is worse though

#

you cannot anticipate which layer is next

#

but assuming routing does not degenerate and cache behavior does not kill you they both seem like sure things

dawn vine
#

this is all for a bolt-on to RWKV and I have the interesting problem that token-shift is not convenient for MoE, so for now I'm using untokenshifted additional experts and considering the base rwkv chanmix as a kind of 'shared expert'

fallen spear
#

i know what some of those words are

dawn vine
#

I will rephrase: this is all for a bolt-on to RWKV, which uses a non-standard FFN, so I'm just keeping the non-standard FFN as is and adding (literally adding the results) of extra experts