#d_ff/d_model + swiglu tests
1 messages · Page 1 of 1 (latest)
Created so I can stop spamming #research and #off-topic with it
original thread in #off-topic begins here: #off-topic message
thread in #research : #research message
Hypothesis to test:
- d_ff/d_model is optimal at exactly 1
- swiglu and geglu are worse than gelu, and specifically only appeared to be better because the adjustment for isoflops brings d_ff/d_model closer to one
in the swiglu/geglu case they adjust the FF ratio without changing param count and without changing the fraction of params allocated to FF vs attn
(bc the meaning of "FF ratio" changes with those activations)
with a non-GLU FF, if you change the ratio, then either you change total param count or you change the FF vs attn balance
so it's not obvious to me that "use the FF ratio from the GLU paper, but without the GLU" has a clear meaning
Proposed test:
Retrain pythia with d_ff/d_model from {1, 2} and activation of {gelu, swiglu, geglu}
Will do smallest pythia first
so, not keeping the param count identical?
the ff ratio in this case referring to the width of the activation
not the number of parameters
isoflops is a bad test
isoflops requires you to alter at least two hparams at once
in this case specifically if swiglu/geglu are worse or equal they have more parameters
so they really should be better
if only d_ff is changing, it seems intuitive that larger d_ff would perform better (up to some point where training goes unstable or smth). bc the larger d_ff model "could" always just ignore some of its neurons, if that were optimal
but ofc that's just a guess.
Okay I am tired of typing the whole expression so I am going to call d_ff/d_model “hidden ratio” henceforth
I have seen zero empirical evidence that hidden ratio above one is good
So far as I can tell it was set at four in the original transformers paper and never questioned
Swiglu/geglu modify it to maintain isoflops
For weak indica that one is the optimal ratio see images attached to this post
Goal is to get stronger indica
makes sense, if you're keeping total params constant
(which was true in both of those images)
otherwise, higher ratio = more params, and it seems hard for more params to hurt given the general smooth scaling properties of transformers
another way to look at it is: adding a new FF block is not too different from doubling the size of an existing one. like if the depth is already large, and representations are changing smoothly across layers
I am unsure if this is the case
The new paper says it keeps d_model constant
so it's hard to have a situation where half of the FF neurons at layer N are actively harmful, without it also being the case that if you added layer N+1, its entire FF would be harmful. which would be wild
they did, and varied d_ff and layer count together to maintain total params constant
hence "41M," etc labels
I'm going to fuse their two tables that report relationship of layers to hidden size and that report results and graph them, I think
Could remake this plot with the x axis in log scale?
the uptick in the biggest model in ppl as you reduce dim is the only datapoint that looks at all favorable, the others only gain ppl after the hidden size drops below the model size
regarding the stated hypothesis, from personal experience (including a single training run I just did to make sure) for small models (say 123m params) ff_ratio>1 is better than ff_ratio==1 when keeping everything else the same and training for the same number of tokens
whether or not it's a worthwhile tradeoff in place of other kinds of resizing, I don't know
maybe if you adjust for the amount of layers to make the same total parameter count, ff_ratio=1 is optimal; that's what impact_depth_width_results.csv implies to me
Ran three short 50mm token tests, and if adjusted for parameters by adding layers, your hypothesis won:
params=123m ff_ratio=3 n_layers=12 d_model=768 loss=3.664
params=123m ff_ratio=1 n_layers=18 d_model=768 loss=3.656
params=123m ff_ratio=1 n_layers=12 d_model=900 loss=3.666
But with caveats: despite having the same parameter count as the first model, the 18 layer model was about 25% slower in actual training time per token
It would be interesting to control for wall time as well as parameter count, and see whether the optimal ratio changes.
One recent paper that controls for wall time is Kaddour et al., 2023 (optimizer evaluation). They adjust the learning rate schedule and number of training steps, and measure perplexity after training only.
yeah, the other thing to take into account is that mundane things like autograd may cause your memory usage to be a lot higher with 18 layers than 12. Mine certainly was.
That makes sense. For smaller model sizes, I also wonder if GPU utilization might be a bit lower for models with narrower d_ff, which could also contribute to the slowdown.
Easily possible. And as your size increases, you might need an insane number of layers to displace the parameters generated by ff_ratio=4
Despite the slowdown, I think this ff_ratio=1 may be a good tradeoff, at least at lower scales... it seemed a lot more effective than adding parameters in more traditional ways in terms of speed versus loss improvement!
If you want similar wall clock time at isoflops increase model dim instead of layers
will also increase attn cost but that amortizes with scale
argh very sorry, I quoted bad numbers above.. the result is the same but it's a closer call than I thought
will update in a minute
fixed above:
params=123m ff_ratio=3 n_layers=12 d_model=768 loss=3.664
params=123m ff_ratio=1 n_layers=18 d_model=768 loss=3.656
params=123m ff_ratio=1 n_layers=12 d_model=900 loss=3.666
ok, trying that now
wait, if that’s not what the first row of most recent is then how did it gain params?
i think param count off?
yeah sorry i keep making bad edits
it was right above, i just tried to copy/paste and deleted part and retyped it incorrectly 🙂
you good, i just don’t wanna be reasoning from off numbers
even slower with bigger d_model
oh one other thing I should mention, this is a little nonstandard - it's using essentially the rwkv ffn and my own modified attention sublayer so take it with a grain of salt
i'm going to guess it has to break the matmul into two ops
but it should be a reasonable approximation
ngl i like rwkv better
main reason to use a standard transformer is just for the 1:1 comparison
does make it harder to reason about why it would be breaking the matmul up though
yeah i just had this lying around and figured id run one experiment.. then that became three
i am blessed with bad local hardware which gives me the go-get-a-runtime friction to prevent me from yoloing
its easy for me to run it on standard models - i have them implemented too - just had the data from a run here already
hehe yeah im running this on some remote 4090
vast? i keep hitting places that have 0 availability
yeah, ive used jarvis and vast mostly
agreed about the 0 availability, it sucks
i almost went out and bought a 4090 a couple months ago bc of that
i may still
i am broke but tempted to buy a used a100 on credit which is not a good impulse
4090 is fine if you can get along with 24gb
roughly same speed as a100 in my experience
memory constraint limits some stuff but fair enough
… wait I do have access to a 4090
here’s hoping he’s awake
surprisingly, it looks like the larger d_model version is only about as good as the original wide ffn
updated the chart above
well, that was interesting!
thanks, it does suggest pretty strongly that 1 isn’t actually the good number
do you have estimates of the standard deviations of these losses, even if they're just manual guesses?
sorry, I don't... but they stayed in a very consistent horserace so they're not wiggling around a lot
no, that's perfect. it's exactly the kind of plot that we can stare at to eyeball std devs
well, interestingly 1 was the good number here, if you didn't care about runtime and adjusted for parameters by adding layers
I didn't necessarily expect that, though I really had no idea what to expect!
all I knew going in was that 1 is severely suboptimal if you can just add size to the FFN without worrying about cost
i think the paper that set me off did that tradeoff and it continues to be true until about 400m params with ratio at about 2
ratio below that not tested for that many params
i went nuts because i’ve been bothered by a lack of ablations to d_ff for a while
makes sense - i was definitely interested when I saw this thread and thought about it at all
gotta say, the relatively small gain in loss for throwing out 75% of the parameters still an interesting result
but those are some nice and steady losses, it’s impressive actually
its on a small randomized part of the pile validation set
the training loss chart looks a lot choppier 😉
lol fair
since I'm studying learning rates right now, I'll comment that I believe learning rate is supposed to decrease when depth increases
… actually that explains a result in the paper
loss goes up a smidge as they max out on depth towards the end
hparams identical across all runs
well maybe its even better than my experiments implied then, since I added 50% more layers but didn't adjust LR at all
....
is this actually telling me the effective rank of a ff layer of this size is less than ten
oh, references: https://arxiv.org/abs/2001.08361
https://arxiv.org/abs/2310.19956
I think yes, in the sense of something a bit less than 'how many linearly independent columns/rows are there'
what's interesting about it is, that at 12 layers (for this d_model=134m), it seems 'optimal' in that you pack as much linearly independent info as possible into each layer
[probably not] coincidentally, that's exactly how many layers people usually use for such a d_model
I have to recheck but it seems somewhat at odds with results here: https://arxiv.org/abs/2206.06072
they used vision transformers and a more involved type of decomposition
the main result I see is that it looked like in both 134m and 374m they were optimal ppl at around 2.66x ff_ratio, but a lower ff_ratio more like 1.5 for the tiny 41m model
which interestingly does not at all match my tests from last night
"When 𝑑model/𝑑ff > 1 (red dashed rule), perplexity slowly increases"
I don't see that as being true in their data at all
perplexity looks clearly lower to the left of that red line for all of the models
not hugely, but it's still lower!
i mean this is from the same data
and the csv above it
yeah, im talking about the ratio not the ff size which is what your plot shows
i am trapped on mobile but i think the only one that gets worse before crossing ratio of 1 is the biggest one
yeah you have to go get d_model to figure where it should be on that plot
for each curve
can also check the csv for it
i am ignoring every metric besides validation ppl tbh
oh i didnt expand the csv, only saw first few rows... looking now
wow my eyeballing was almost exactly right
best ratio for the biggest ones was around 2.66
smallest one very hard to tell from the data
but likely somewhere between 0.7 and 1.4
as mentioned, this is directly in opposition to my test results last night
which showed 1.0 as being useful when increasing layers accordingly
(w caveat of slowdown in training)
i would suspect hparam issue for depth
but also, that’s a different and weaker hypothesis to test than that hidden dim is useless or only marginally useful
i just ran more tests w/ different LR for the deep model.. limited evidence but lower LR was not better
yeah it may be only marginally useful above 1.0
but in reality I'm not sure I can train fast enough with deeper models to make it 'worthwhile' to go to 1.0
i’d want to establish exactly how marginal and then where else equal params can go, tbh depth seems like the worst contender
specifically bc it harms both training and inference time like that, yeah
got other ideas besides increasing d_model? that didn't pan out for me
switch activations to something fancier
add attn heads
too involved to test on a lark, but: add attn sink tokens
yeah increased attn size/heads could be helpful
which doesn’t add params but does increase flops
i will try more heads (so bigger total attn size)
MoE also
can do moe with four experts at ratio 1 for same cost
eight experts if only the down projection is an expert
i guess i am assuming no params in routing function or that such params are negligible
i cant try MoE easily, and personally I'm more interested in practical ways to increase function of the base transformer than adding different architectures like that into the mix
but i can try heads/attention sizing easily
that's not to say MoE isn't a valid idea
both moes and factorization are hobby horses i’d want to keep as last resorts
bc they don’t test the hypothesis cleanly
entirely separate cans of worms
eg another indirect test is to try to replace well trained up projections in existing llms with similar ones of ratio 1
but much like quantization that could only work after training is done
so i’m not chasing it because it doesn’t test the hypothesis meaningfully
oh you could also have up to four separate ff + geglu for equiparams, four times slower though
I ran a new test with ff_ratio=1 and more attention heads instead of more layers and it did the best so far
my next question is: if you always use ff_ratio=1, can you skip the hidden dimension entirely and just run your activation function on the input and project to output via a single linear layer?
I would think this would harm performance, the number of nonlinearities along a path generally matters a lot
my testing agrees with you 🙂
... what is the most convoluted activation function
i think the answer is "attention", actually
you can go the other way and make it two matrices of intermediate state
sorry, not sure I understand what you mean
instead of having two matrices to up project and down project you can run three or four
which is very silly
and also slow
haha
well, being rwkv-like, this has all included a gating on the FFN as well
so it sort of does
if we want to follow the swiglu-esque route I am thinking of "what would eat up parameters"
and throwing a qkv in there would do that
but i think we're verging on trolling <_<
trolling ourselves lol
it's not a perfect test bc of parameter count mismatch but removing the gate in exchange for another layer was not a good trade
(just tried that)
my best revised hypothesis so far, from my testing results, is that hidden_dim=1 is a good trade for more heads, but that it may or may not be actually worth it in terms of overall training time slowdown
that slowdown will increase as you increase context length
whereas a larger hidden_dim is unrelated to context length
as a result, adding layers could be a more worthwhile trade, but might be more likely to cause you vram problems
i’m gonna do a 1:1 pythia run at ratio 2 and 1 when i can
i might only go to a couple checkpoints on each
this comment made me think about something I was already gonna try but never got around to... lora style ffn
someone in #research did that but i do not remember who
#research message
it was spy
there are other confounding variables maybe idk
yeah not quite what I was thinking of
wait: this doesn't apply to rwkv, does it?
i use traditional dense attention with some surrounding modifications in the time-mix section, but rwkv style FFN (channel mix)
but you're right, it wouldn't apply for rwkv
So what's the upshot of the current tests @dawn vine @fallen spear
for params=123m d_model=768, my tests showed that ff_ratio=1 is in fact better perplexity if you scale up n_layers or preferably n_heads accordingly to maintain parameter count
but the caveat is the training goes somewhat slower per token ingested and may use more vram
so where exactly the tradeoff becomes worthwhile is unclear, and probably dependent on a variety of factors (e.g. for attention, more heads will cost a lot for large context lengths, and for many layers vram or even convergence may become a problem)
And what activation function are you using?
rwkv style FFN w/ GELU and a sigmoid gate
I’m looking at a straight pythia when I can set up an env which did not occur last night
was just trying to get a feel for the outcomes so I used whatever I had handy, but ended up doing a bunch of tests once it got interesting 🤣
so ymmv with more traditional models or other ffn styles
"larger d_ff does literally nothing" doesn't seem like it's doing well as a hypothesis although I'd be curious how it behaves at higher d_model especially, "putting the d_ff params somewhere else is usually better" looks pretty good to me rn
although, honestly: the fact that ablating d_ff only hurts it a little bit in this case is still an interesting result
most ablations hurt ... more than that, I think?
on a completed result if it's got a flat and steady ppl gap like it does in the preliminary test I'd want to calculate how much more training would close it since it is ablating approx 75% of params and that is a plausibly worthwhile tradeoff
my guess is that as it becomes a larger % of the model's parameters, ablating is gonna hurt a lot more
My argument for the reverse would be that as the d_model becomes wider state will be less bottlenecked and therefore less likely to benefit from up-projection
this thing about the rank being only 10 is still hurting my brain
this is a sparsity idea inspired by the hidden dimension expansion. is there a way to be intermediate between (hidden=input dimension) and (hidden = 4x input dimension)? sure, let's suppose the hidden dimension is 4x the input dimension.
divide the hidden dimensions into 4 blocks, and label them 00 01 10 11.
divide the input dimensions into two sections: A and B.
section A outputs to 11 and 01. section B outputs to 11 and 10. block 00 receives no inputs; it's just there for notational convenience. as a consequence, each neuron outputs to half the hidden dimensions. the output connections are left fully connected for now.
here's the intuition: the binary code of the hidden dimensions says "what coordinates I will accept input from". 11 takes input from everything, so it's fully connected. 00 takes input from nothing, so it's not connected. the 11 block reproduces the ff_ratio=1 topology. the other dimensions are for passing through parameters without being able to process them, if that's what the extra dimensions are doing.
in the original transformer model, the purpose of those extra dimensions is probably to expand the number of nonlinearities that can be created. but maybe some of those nonlinearities don't require the full flexibility of all N dimensions; that is what the 10 and 01 blocks are for, as they do partial processing.
the division into A and B is hacky and mathematically unpleasant, so it won't do much more than passing through parameters.
the division can be extended, such as into 3-digit binary codes + sections ABC. in the limit, the construction actually makes sense both mathematically and computationally, since hidden dimensions correspond to tuples of input dimensions. but we're nowhere close to that.
i like the direction you're going, but I'm a little unclear on what blocks 01 and 10 help with
my idea was more like this:
I adjusted a ff_ratio=1 FFN to learn additive differences i.e. instead of w_out(w_in(x)) it's w_out(x + w_in(x))
that still worked pretty good!
so then lora it i.e. w_out(x + w_in_b(w_in_a(x))) where w_in_a and w_in_b bring it down to 1/4 size and back
or maybe skip the w_out entirely and run the activation function in the middle of the lora sandwich
so far the lora part hasn't panned out very well 🙂
but I may have bad initialization
suppose there was no nonlinearity. then ff_ratio = 1 would be all that is necessary. since ff_ratio > 1 is creating benefits, that means there needs to be nonlinearities in more hyperplanes than there are input dimensions. but consider what these nonlinearities look like: most likely some are very nonlinear, and some are mostly linear. a 01 block has a reduced ability to express a nonlinearity - so it's more appropriate for a mostly linear axis.
so if we look at it from an expressibility point of view, if we have a mix of nonlinear and linear things, we put the linear things in the 10 and 01 buckets, which frees up the 11 buckets to express the nonlinear things. the 10/01 buckets are cheap and less powerful, good for linear things
and since we're in the high-dimensional regime, being expressible could mean that it's trainable too
do you skip activation function on 01 and 10 and just run it on 11?
activation function on all three
consider what an activation function on 10 can do: its hyperplane can only handle dimensions from B. that kind of sucks. but that may be good enough for some nonlinearities, and the really hard nonlinearities train their way into being in 11 instead
my impression from other lora tricks during pretraining is that lora is actually kind of jank during pretraining
the final result for models might generally be low rank but gradients are not necessarily of low rank
So the original work here seems to claim that for a fixed parameter count you want more depth, rather than ff expansion. I was excited to try using a deeper but "thinner' model, but then I realized that the deeper + thinner model is going to be insanely more computationally expensive. Having very wide MLP's is good because it's very efficient and very parallelizable. So anyways, reducing MLP width a lot didn't even afford enough compute to add one more Attn+MLP while being compute equivalent
I really dislike "parameter equivalent" comparisons that ignore the compute requirement. No free lunch here.
If anything, my benchmarking lends a lot of credence to PaLM 1' very high 8x expansion.
that’d be an interesting experiment too
but: parameter equivalent ignores compute (and anything else weird you do while doing it), compute equivalent ignores inference time (and maybe something else I am missing). ablation is nice because it lets you check exact amount of degradation along some axis
... actually that's an interesting perspective, computing isoflops against their param-matched thing, presumably there's a minima for perf/flops on their graph somewhere
i guess your objection isn’t compute so much as wall clock time; the projection adds little or no time because it’s nicely parallel and without it the ff is a lot thinner than attn so it’s a relative dead spot in the utilization
current status
How about adding more attention heads instead of layers? That was more effective for me than adding layers in my tests (but may still be expensive if you're using traditional mha and not some linear attention variant)
Probably better than the approach in the paper for the test
but: still leaves the ff pass as an underutilized block of time in the gpu
i am increasingly convinced to find more ways to jam parameters into the activation function
I'm still confused about your idea on that
geglu or swiglu basically trade up projection size for a more complex trainable activation, any function that takes extra params can be used to do the same but more
i am definitely still woolgathering for what function makes more sense though
basically geglu/swiglu replace an activation function of one variable with an activation function of two, if you assume up projection is the worst option you should be able to go from one big matrix to doing an element wise multiplication of four matrices with (virtually any) nonlinearity applied to them and beat the up projection as an operation
i am trying to consider only one index in the activation and think of functions of three or four scalar variables that have any obviously desirable or intuitive properties like how geglu/swiglu are “gating”
Then why not just use two ff_ratio=1 ffn w gates in a row
increases depth but doesn’t sound bad
I guess that's still fewer parameters bc gate is only 1 way
alternatively do (gate + gate + gate) * nongate
my brain wants there to be an elegant analytic thing that makes some kind of sense to eat up params
other possibilities: maxout, multiply nonlinear gates together
Have y'all read the muP / hyperparameter transfer work? I'm wondering if it implies we don't need to search through the whole search space for every model size
from an expressibility point of view, if you have two activation functions that are different but equally good, then using both within a layer should be at least slightly better, as long as it's free computation-wise to divide a layer into two blocks and calculate different activation functions on each block
I have - which hyperparameters were you thinking of that may need adjustment for this project?
I have read it, my impression is that at the smaller scales we’re using things like depth are just fundamentally non transferrable because they behave differently at scale; i think i’d be more comfortable making that call for given adjustments iff those adjustments seemed to be scale invariant as we doubled total size a time or two
it should be very close to free
eg for simplest case if we had a straight line for effectiveness of ratios 1, 2, 4 across a couple scales i’d be comfortable that the ff ratio’s properties probably weren’t scale dependent
but: given, as a confounding variable, that they are also varying depth: that doesn’t appear to be the case for existing work
it is possible it’s only scale dependent at smaller scales
also going to stop being an empiricst and say: i should probably check if the muP's theoretical justification for transferrability would apply to d_ff ratio
also: this is clever and if it cleanly beats existing ffs it will be amusing
it is so simple
mixture of 1D activation functions + bilinear GLU = can express any product of usual functions, as well as their sum. such as polynomial approximation, or x^2 e^-x, etc
plus function composition, like e^(-x^2), composing x->e^x and x->x^2
i don't see how it does composition but maybe if i squint at it longer
you need two layers in this case
honestly, it's really like a universal function approximator. take any expression, decompose it into single operations (such as by Reverse Polish Notation). then each layer computes one more term of the expression
i can see how to do it with successive layers, i'm a little more iffy because that adds depth
yeah, no need to increase depth here. for a single layer, improvements should already exist
well, maybe. the benefit of having different activation functions in a layer is second order, which means the benefit is stronger in longer training runs than in shorter ones. but I wouldn't change the length of a training run just to inspect this change
here, what I mean by first order is "this neuron is outputting 5 but it should output 10", and then it moves from 5 to 10 linearly. second order means the change is through curvature rather than slope
That makes sense to me, it is sort of interesting which mathematical points are clear and which aren't
... heh, I go looking at mUP transfer and their primary reference is this bit of NTK: https://arxiv.org/abs/2011.14522
and: it would have been concurrent work with gpt-neox, not sure if there's a strong case for gpt-neox to use the initialization etc it does instead of this one
To clarify, does gelu(linear(x)) * sigmoid(linear_gate(x)) meet this definition for second order? And if so, are you proposing we add more terms so as to get to third or fourth order?
here's an illustration of what I mean by second order: if you have points (0, 0), (1, 1), (2, 4), then a linear neuron can try to fit these points, and a quadratic neuron can fit them. the competition between these two neurons is second order. it takes time to train weight away from the linear neuron into the quadratic neuron. being second order in this context is a bad thing; it just means training is slower.
my proposal is to have a mixture of existing well-performing activation functions within a single layer (like ReLU and GLU), rather than to increase the dimension of an activation function.
Sorry, I'm basically asking if gating already accomplishes this via use of a second activation function
in terms of achieving diversity, yes. having two activation functions is better than one.
there is an unpleasant niggle there where multiplying two functions causes quadratic behavior, which causes gradient explosion (to a small extent) and hence should slow training down a bit. so gating may not be free. but tests seem to indicate it's worth it anyway
Yeah I've seen it always be worth it
But if two is good, is three better?
Because gating is already a somewhat standard FFN improvement and maybe it works for the reason you outlined
Also, people already reduce the ffn width to accommodate the gate's parameter increase
Which is exactly the kind of thing we are discussing
a three-gated neuron could be better, but it's just speculation from my part. I don't have a mathematical understanding of how to balance the tiny gradient expansion issue with the improvement from diversity
Your point about gradient explosion is well taken
I like the idea of allowing the ffn to simulate a more complex function tho
Via say a fourth order polynomial
conceptually, mixing one-dimension and two-dimension activation functions (especially bilinear rather than GeGLU) is very similar to having skip-connections
turns out i had corrupt files from git lfs and the loading script was crashing silently :P
going to do a manual parity check once current lfs is done since apparently lfs doesn't have a parity check???
uhhhh so i finally did the math correctly
if you do mxm instead of mx4m projection you save 6m^2 parameters
and then if you jam all those params into your up projection, your inner dimension is 6m, and you can then run a complicated nonlinearity on them to reduce the resulting state down to size m
giving you a d_ff to d_model of 6 if you count from before the nonlinearity
and 1 if you count after
at equiflops
i was wondering why palm reported a ratio of 6
im missing something, don't really understand what parameters are going where in that
how are the experiments going?
you take an input vector of size m, you run one m by m6 matrix or equivalently six separate m by m matrices
you perform (some nonlinearity that reduces size of vector by a factor of six) and then you apply another m by m matrix
you have d_ff/d_model of six if you count before the nonlinearity is applied and are at equal params to standard transformer attention
palm reports a ratio of six and as far as i’m aware nobody knows what nonlinearity they use or why it’s six
so … gonna go with that theory
it’s sort of a natural extension of swiglu or geglu
happy to try this or other exps if folks are curious, have the setup for it.
this kinda reminds me of the sigmoid gating in RWKV
realized you could also gate the residual connection
I've tried it. Was a small but measurable advantage
Only way it worked was t, 2-t
Can you break down how it was gated in more detail? Not sure I get it.
Sure.
t = residual_mix(x)
x = x * t + fn(x) * (2-t)
No guarantees it works in a more general setting, and at least one person tried it and claimed the gating ended up stuck at like .9 for all channels
But it consistently gave me a small loss benefit in training. Never looked into it enough to know if that was still true at validation or test time
@fallen spear are there larger ffn exchange experiments you need done? E.g. a larger version of my swap ff_ratio for attention heads test
i have gotten myself stuck in environment hell, does it work for you if I just send FF layer variants at equiparams
(on the plus side i should have a really clean environment i can reuse whenever i climb out of my current hole)
https://gist.github.com/segyges/4e8c65913df415e8c214d8e27dadbb8b
take your pick
the only ones I am sort of serious about are ProductPoolingWithGelu and YinLU
todo: figure out how folks are doing their residual connections, I can use some of the literally six different matrices we have to play with to gate and scale it, probably with a swiglu or geglu. will have fundamentally different call signature inasmuch as normally the FeedForward is self-contained
I can try some of these tomorrow!
actually, maybe try product pooling without the gelu too
@boreal moss already tried some variations very similar to these, but with only 4x. They worked pretty well, especially for smaller and/or non LLM models.
the ones to beat are geglu and swiglu since they’re currently considered sota
my guess is going to be that several of these are better bc the params are better literally anywhere but in a down proj
(geglu and swiglu not represented in that gist)
writing all those out gave me more ideas
more than 3 parallel layers multiplied together makes it perform worse, adding any activation function anywhere makes it worse, I was testing on <1M parameter models and it was better than gelu by a large margin, doesn't look like it holds for bigger models
can you give the exact activations used that performed poorly once scaled at all?
there is a basic problem where there are too many different things you can do so specificity is helpful
I expect this to be significantly easier to train than YinLU: pooled = torch.sin(x_1) + torch.exp(-torch.pow(x_2, 2)) + torch.tanh(x_3) + torch.nn.functional.gelu(x_4) + x_5 * x_6. the reason why is that multiplying lots of random variables together will cause gradient imbalance (= gradient explosion, even if normalized by layer norm), since occasionally the multiplication will be very big and usually not so big.
i christen it yinlu2
of course, the reason why the original activation function is called "YinLU" is because Gyges deserves all the credit for the idea, so by Stigler's law, I got the title
i propose fourierlu: sum of sin(x_1*x_n) for n from 2 to 6
i cannot write it right this second
tbh i wonder how many examples of stigler’s law are due to name exhaustion, naming one after you was the last step before i just started concatenating silly prefixes onto each other
i can at least remember that yinlu is “precisely that function that was transcribed verbatim from a message by kevin yin”
i have no idea which is which for the others
i suspect this is also a good test because we should see some of the terms zero out consistently by operation which will give us a good metric for which carry more signal
yes, inspecting the activations and L2 weights of the final network is a good idea. in my original plan, which is just a mixture of different neurons inside a layer, I would dynamically change the proportions of the activation functions to reward the ones that did better for a problem. but since all 6 activation functions are summed together in your network, their proportions are fixed to 1/6, so it's all-or-nothing
note that it's the nonlinearity you care about inspecting, not exactly the zeroing out: what separates one activation from another is whether the inputs cause it to travel along a wide range and express a diversity of its outputs. if the inputs stay within a small domain (like [-0.1, 0.1] for cos(x)), the activation function isn't special; it looks just like a parabola over that stretch, just like any other activation function
in retrospect it's not a linear unit, so call it "fourier pooling" or something
also i think this removes the need for layer norm because the output is bounded to a maximum of 5
i guess unless "collapse of activations to zero" is one of the collapse modes
I agree, no layer norm needed
performance of fourierlu will probably suck though, for the same reason that sigmoid fell out of favor
there's an alternative Fourier formulation, where you take x_1 and x_2 as a complex number x_1 + i x_2, and same for x_3 and x_4, and x_5 and x_6. so it becomes a complex exponential e^(z_1 z_2) + e^(z_1 z_3). however, e^(x) may be a very crazy activation function, and it also produces two outputs instead of one. although exp is the more natural fourier structure, unless exp(x) behaves well (I doubt it), it will also not work
you can take angle or magnitude for the resulting complex to make it scalar
but then i think maybe it reduces to an inner product
i am not sure this apples btw specifically because sigmoid gradient approaches zero as x goes to infinity
ok, that's a good point. in terms of activation functions, all intuition is thrown away and we are reduced to guessing, so the sigmoid lesson is sufficiently dissimilar that it doesn't transfer
i think also you can rescale your linear layer for cases where you end up below or above 2pi?
what are the weights and biases for the case sin(x_1 x_2)? is it sin((w_1 x_1 + b_1)(w_2 x_2 + b_2))?
yes
My concern would be numerical stability if any of the X end up very large, having perhaps gone through many cycles of pi
with this set of weights and biases, you can't shift by 2 pi anywhere
sin((w_1 x_1 + b_1)(w_2 x_2 + b_2) + 2pi) gives the same value, but there isn't a way to merge the 2pi into any of the constants
whereas if we had sin((w_1 x_1 + b_1) + 2pi), we could write b = b_1 + 2pi and do a shift
huh, maybe it would need normalization of input then
“move normalization into the ff layer” was not on my bingo card
I don't see how the input could get large either. it's a sum of many sines (all capped at 1), times a dense matrix. so the input x will only get as large as the terms in the dense matrix
oh wait, the input is from the attention layer, not another fourierlu layer
well, attn + linear
i would suspect that empirically with reasonable init and reasonable lr it will not be a problem
but it's hard to rule out completely
my knowledge of norm functions is also pure guessing, and my intuition doesn't work
sometimes batch norm is needed, sometimes layer norm, etc
my other concern is zeroing out inputs, especially if x_1 manages to become zero
since sin(0) is 0
i dimly suspect slight nudge of noise might be a good idea
you could change sin to cos. maybe it will make a difference, maybe not
i think same problem, at input 0 you have constant output and no gradient
right, cos is even worse
so sin(x_1 x_2) is effectively bilinear when x_1 or x_2 is near 0
i think in some high level way they are the same problem and have the same solution, which I might be predisposed to see as gauss for fern reasons
nah: sine is bilinear when x_1 or x_2 is near 0. but cosine is fixed to 1 when either x_1 or x_2 is near 0 and the non-zero variable has no effect. it's a quadratic vs linear problem
i read like four papers about trigonometric activations and cannot remember if they addressed this specific thing which seems very prominent when just looking at it mathematically
https://arxiv.org/abs/2006.09661 this one, I think, was the best mathematical treatment
they're pretty opinionated about initializations so worth thinking of
this paper illustrates how ReLU can only represent low-frequency details efficiently
updated, with more bad ideas: https://gist.github.com/segyges/4e8c65913df415e8c214d8e27dadbb8b
i think all of these are interesting from InnerProductPooling on down
Any prioritization of which ones you think are most interesting?
That's a lot of ideas!
I'll start with the first one, except i plan to use an expansion of 4*4/3 instead of 6 to keep it consistent with my baseline.
6 should be equiparams with a standard ratio of four
basically: the extra params are coming out of the down projection
i guess we could also push the extra params into the down projection
my baseline is geglu which was equiparams with gelu using expansion of 4
oh, wait, yea I see 🙂
yes, equiparams with that in this regime should be 6 up and 1 down
is there a linear layer before AveragePoolingWithGelu?
check top module in the file for intended impl, everything gets up projected x6
for my money: inner product pool, the geglu derivatives towards the end, the fourier pool are the most interesting
a significant fraction of the activations in the middle of the file are “why not”
and being unable to come up with why not
anything that multiplies in more than three things is almost certainly a bad idea
the max and average pool towards the beginning are almost certainly not good but might still beat geglu
but this is an LLM; how could average pooling possibly help? adjacent neurons are not related to each other
my rationale is basically that geglu does not make any sense either and anything which pulls params away from the width of the down/up proj is probably an improvement by default
i don’t see a good, rational reason for avg pool to work
i should say: away from the width of the activations
I'll try YinLU, inner product pool, fourier pool, and one of the geglus first
start with yinlu2 probably
since it’s additive instead of multiplicative it seems less likely to explode
and here's my training codebase just for reference, the baseline is the "ibt" model: https://github.com/jonmorton/tart
average pooling does not change what is expressible; it only changes the training dynamics. whatever function is expressed with dense matrix->avg pool->GeLU can also be expressed by (dense matrix with avg pool baked in)->GeLU. similarly, you can reverse the transformation with (dense matrix with avg pool inverse)->avg pool->GeLU to produce just (dense matrix) -> GeLU. that means AveragePoolingWithGelu is effectively just a GeLU layer. in practice, the training will be different since it's imposing an inductive bias on training of "nearby neurons have some similarity", but that's not an inductive bias that is appropriate in the setting
I would also have to think about the training dynamics; it's possible that the gradient undoes the avg pool, but I have no intuition about that
i think the same argument can be made for up projection; in principle up-projection x4 and down-projection x4 doesn't seem like it should increase expressivity, in practice it does help at least a little because it's easier for the nonlinearity to "find" good values in the 4x larger activation, apparently?
yann lecun said one of his biggest lessons from the success of transformers is the power of multiplicative interactions though XD
sure, just, six of them in one thing seems like maybe a bit much
up-projection does improve expressivity of the nonlinearity. for example, consider a 1D input, then a super-wide hidden dimension with ReLU activations. the hidden dimension lets you fit a piecewise linear function, and the number of kinks you're allowed in the piecewise linear fit is the dimension
or, consider Fourier activation. the larger the hidden dimension, the higher the coefficients in the sines can be, and hence the sharper the fit can be (because it can express higher frequencies)
i am convinced that avg pool is probably not even worth testing
but: you can have a piecewise linear function with (say) up to 4,000 kinks, does it benefit you if you spend 4x as much compute and can in theory fit one of 16,000 kinks?
or, more clearly: does it benefit you a lot?
my intuition here is that training becomes a problem first. especially the asymptotics: fitting x^2 with a ton of kinks will be a big mess. it's simply not the appropriate activation
so even if you fit y = x^2 for x in [-100, 100] perfectly with 4000 kinks, you will never have the right asymptotic and then the fit will naturally become very bad at x=1000
https://arxiv.org/abs/2002.05202v1 I read this in a good amount of detail and the indication seems to be that basically anything other than straight up and down projection is preferable
but: it limits itself to nonlinearities of two variables, from which we get the now-sota geglu/swiglu
I also share that takeaway, but be careful with the statistics. the p-values are probably not even 0.05
i don't think we can even meaningfully calculate a p-value on them
kat specifically says that her experience has been that geglu beats swiglu even though that paper has swiglu win
sure you can, there are two distributions: 1-variable and 2-variable activations. you can calculate standard deviations, and there's a statistical tool that lets you test if two sample distributions have different means
also my experience, but I think it's just noise
so according to normal statistical standards, this paper doesn't meet the bar for significance. it's still a good paper because of the cost of the experiments
so to reiterate motivation: based on papers linked up top with a fair degree of apparent reproducibility, pulling params out of ff up/down and putting them into depth generally wins at scales of I think 50-million-ish params, undertested (imho) regimes are: simply ablating those params outright and checking what the impact is as scale increases, using weird pooling-ish activation functions that extend on the regime used by geglu/swiglu (as stuffed into that gist), pulling the d_ff params and using them to increase d_model, stuffing those params into extra attention heads, and (occurred-to-me-today): using some of those params to gate the residual connection
weird pooling functions are just kind of the most compelling because it's straightforward to test with and also it scratches the itch to do weird math
i guess: and if it works it presents the same "clear win" criteria that swiglu/geglu do where in principle you aren't trading off anything
also, at scale the up/down proj dominates model RAM footprint and it seems incredibly silly to me if they are not serving a very good purpose
the idea that llama is a 70b model and could profitably be a 20b model at negligible loss and there isn't meaningful empirical testing to say otherwise kind of offends me
i would say it just needs more trials but each trial costs a fortune at scale and scaling annihilates so many results from smaller models
also, the fact that palm has no published crunchy technical details but reports a d_ff/d_model of six makes me suspect they are doing something like this to do a more complex nonlinearity that ultimately reduces the activation size to d_model, because six is the ratio you'd get if you tried to do this and remain at equiparams
which i figured out ... a day or two ago, even though I'd been obsessing about this for some time
so i would actually guess that google internal research has already done this and found it to beat swiglu/geglu
i have resisted the urge to tag the google/deepmind folks in here to ask them
it seems like that would be impolite
I agree. I expect some private research companies are already using mixture-of-activations, and your version of it seems no worse than other versions, given that intuition about activations is noisy and difficult
it is frustrating that there is no mathematically obvious way to do it, which puts us into the "there are really an infinite number of six-to-one functions" regime
also this looks substantially similar to what I was starting to write yesterday and i am unsure if i am going to just clone it tbh, probably not
actually to extend this: every indication of the rank of transformer up/down proj matrices I have seen indicates that they are of extremely low rank
like, one of the linked papers has them at ten by their measuring method
it seems bizarre that an m by 4m matrix, when trained, has an effective rank of ten if the projection to 4m is doing anything important
"ff1"
fc1 = nn.Linear(64, 256)
fc2 = nn.Linear(256, 64)
x = F.gelu(fc1(x))
x = fc2(x)
"ff2"
fc1 = nn.Linear(64, 128)
fc2 = nn.Linear(64, 128)
fc3 = nn.Linear(64, 128)
fc4 = nn.Linear(128, 64)
x = fc1(x) * fc2(x) * fc3(x)
x = fc4(x)
"ff3"
fc1 = nn.Linear(64, 128)
fc2 = nn.Linear(64, 128)
fc3 = nn.Linear(64, 128)
fc4 = nn.Linear(128, 64)
x = fc1(x) * fc2(x) * F.gelu(fc3(x))
x = fc4(x)
fc2 in ff1 looks backwards to me, i am guessing it is meant to be baseline gelu ff block?
a pretty sexy graph though
ff3 improving more than ff2 at the far end shows a behavior that should be pretty typical: the gelu breaks symmetry, so it has more expressivity, but training that expressivity is second-order rather than first-order. so the benefit from symmetry breakage appears only after a longer period of training
that gives me an itch to put gelu gates on the fourier pool
I also notice from this chart that all the training samples are loaded in the same order for each training run, which is ok
smoothness as in less variance in the loss function? I don't see that in the chart
I see it, but to be sure I would have to actually calculate it 🤣
i added a gated fourier pool to the gist just for fun
could also put the gate around the entire thing though
this is very special case with small ffns, it was 1D causal convolutional network with 8 layers and task was character level next token prediction on tiny stories
i added a second gated fourier pool
but my point is that you all should try those multiplicative ffns without any additional activation functions
with the exception of outright max/avg pooling layers i am kind of agnostic about which of these functions make any sense, i think it mostly makes sense to test those that are analogous to existing workable functions and/or that have good diversity with each other
"no actual activation function" definitely fits the bill
but: there's a whole body of stuff about how small scale stuff benefits from modifications that are neutral or bad at higher scale
yes
from my experience it looks like in very small scale more expressive activation functions can make huge difference on the order of adding 25% more parameters and for larger scale the same modifications makes almost no difference, like if big models have no use of more expressive activation functions
the currently-sota swiglu/geglu stuff is, as has been pointed out, relatively statistically marginal
Alright I had to redo the baseline training but now I've started the first exp, you can see the progress at https://wandb.ai/jonm/MLPs
Since avg pool looks like it's not going to do as well, will probably cut it off early.
yes, there's no reason to believe that avg pool will do well
👀
Should have implemented eval perplexity though, token accuracy is probably a bad metric
either way the val loss should be instructive
was gonna say, unless you are planning on running it rather long and it's rather big they shouldn't differ a lot
trajectories tend to be the same unless data is so small you can overfit or training is so good you grok, I think?
Yeah, I mean, sometimes I have seen things do slowly converge to a better result, so I prefer to wait a bit longer. But these are fairly short runs overall as far as typical llm training goes
Will try fourier next
fourier is my special boy and the one i will be the most sad about if it doesn't work
i have no good reason to believe it will work, and this feeling is entirely irrational
YinLU2 starts at really high loss. I'm pretty sure the reason is the torch.exp(-torch.pow(x_2, 2)) term, which is 1 at x=0
I think that means that y = 0 must be true when x = 0. either the NN intrinsically wants this property, or initialization wants this property. so YinLU2 is bad, and it would make more sense to use "pooled = torch.sin(x_1) + x_2 + torch.tanh(x_3) + torch.nn.functional.gelu(x_4) + x_5 * x_6", where the Gaussian is replaced by a simple no-activation passthrough
alternatively, torch.exp(-torch.pow(x_2, 2)) - 1 satisfies y = 0 when x = 0
yinlu’s 3 and 4 are born
one interesting result from the tests is how well YinLU 1 trained in the initial stages. it looks like layer norm lets you get away with quite a bit of multiplication
tbh enough of them are doing well enough that i suspect we’re not getting stellar signal, which is a positive result but doesn’t narrow our options a ton
which graph are you focusing on? I'm using loss/val
assuming my slight colorblindness doesn’t hurt me too much here, we get: baseline, trigeglu, inner, yinlu2
that yinlu2 has such an obvious improvement and is potentially good for analysis makes it nicest prospect at the moment i think
i will ignore this and spend all my time trying to figure out how to make sin activation happen
yinlu 1 was stopped pretty early, so it doesn't show up in this list. its performance is basically the same as baseline, up to noise
… fairly absurd
huh. it does. wild
i think it’s the only one that stacks products that high, no?
of the ones tested, yes
my guess would be that it’s either one of the specific activations or the act of doing so many products, the diversity of functions is neat but does not seem like it can possibly be optimal
it can't be a specific activation; activations can only suck but they can't be amazing
then, at least early, the cumulative product is very good
a very silly result
i am starting to wonder if there is a level of ridiculousness that cannot possibly be a good idea
sum of power set of all products is clearly too ridiculous
in principle it needs to be an O(n) op
power set of products of a, b, c is just (a+1)(b+1)(c+1)
learning from the Gaussian issue, the correct activations should be instead "(a+1)(b+1)(c+1)-1" and "abc", where abc is from trombocyt
the difference between these two is that the first formula (with 3 inputs) is quadratic linear when a, b, c are near 0. and the second formula is cubic
more easily, you can see this if there are only two input variables: (a+1)(b+1) - 1 = ab + a + b, and note that a and b are linear. so even if a is 0, the b term will still cause changes in output
fc1 = nn.Linear(64, 64)
fc2 = nn.Linear(64, 64)
fc3 = nn.Linear(64, 64)
fc4 = nn.Linear(64, 64)
fcout = nn.Linear(256, 64)
x1 = fc1(x)*fc2(x)*fc3(x)
x2 = fc1(x)*fc2(x)*fc4(x)
x3 = fc1(x)*fc3(x)*fc4(x)
x4 = fc2(x)*fc3(x)*fc4(x)
x = fcout(torch.cat((x1, x2, x3, x4), dim=-1))
this weirdo is just a little better in small scale than "tri" linear
i remain somewhat vexed that there is no obviously correct way to pile together params in this way
i guess i can find another good defense of stacking as many products together as possible
matmul already does addition
any function you can represent by addition is already well represented
Kicked this off as YinLU3
as modified from yinlu2, correct?
Generally seems some stuff is about as good as geglu. This is a strong baseline already (already better than vanilla GPT2 by a good margin), so not bad!
yes
the first gist is too long so I am stuffing my extra brainworms into another one
it will have that and two more yinlus
and a power-product thing and two more fouriers
because ??? reasons
If there's anything from the original list i should try as well lmk
i feel like the trigeglu and inner product are trying to tell me something but i don't know what
the other geglu derivatives might give us some clue about what is working and why
and, I guess, product pool, let me check to make sure I wrote it reasonably
something other than "products are good"
Seems the loss still starts pretty high with YinLU3
they basically vary: whether there is a quadratic term inside the gelu activation or multiplied into it
right, the failure of this activation function isn't because of f(0) = 1, it's something else
https://gist.github.com/segyges/20bb00cf43fe617c2fdb7c6465698853
gist number two
this activation function is just bad somehow
sin, tanh, and and gelu all have different output ranges, does it make sense to add them together?
it does, but my guess is that sin and tanh are so bad that they don't make sense even in a mixture
er, well sin/tanh and gelu do
so forget about YinLU4, it's also almost certainly terrible
i suspect baselining against product pool makes sense since it just forgoes the actual functions and does multiplication
... and powerproductpool
"eliminate effects besides the product operation", more or less
and YinLU3 can be stopped too, its loss is unlikely to ever be competitive
we could maybe make a later note to ablate it to verify which of the components is harming it
on the "activation functions can't be awesome, they can only suck" theory
This seems like something that could be done through neural architecture search (not that I have the compute resources for that or anything)
... hardmaru did something like that
I am coming at this from the "what sort of seems to make sense to me mathematically" direction but that is possibly the wrong direction
perhaps i should try this too as another baseline
five trainable parameters is a good number for us
them freezing the UAF parameters in the middle is very strange
"This is done to reduce the over-fitting of the model and to prevent training instability", which means they saw training instability. why would this instability occur?
what are they actually training and does it even have normalization on it
NNs are naturally unstable, instability in a sandbox problem tells us very little
he VGG 8 layer CNN
yeah
don't worry about their experiments imho, their math looks incredibly clever though
it worked once in a tiny sandbox and the math in principle approximates any activation, good enough for me
... grad students at uvic
act = (
torch.log(1 + torch.exp(x_1 * (x + x_2) * x_3 * x**2))
- torch.log(1 + torch.exp(x_4 * (x - x_2)))
+ x_5 * x_6
)
i think you can just use the sixth param as the input to the activation
o yea
but i do not yet understand their math
my first guess was "someone in science who is very smart but not in ML" but students in uni without serious AI research program checks out
"we did something incredibly clever on a zero resource budget" thank god for resource constraints thank you so much
yeah, this is a much smarter approach to the problem than "make up equations and see if one of them eventually sticks"
And yeah no need to worry about training stability I think, pre-norm transformers are incredibly stable at this scale
famous last words though
i believe firmly in jinxing myself as hard as possible any time i feel the impulse
are you really living if everything you say doesn't sound like famous last words
another note is that all the linear layers are initialized with orthogonal matrices. I could imagine with some of these activation functions that you might want to initialize each (dim, dim) chunk of the (dim, dim*6) matrix differently depending on how each of the 6 components is used in the activation
every time i contemplate the possible effects of initialization on these oddball activations it makes me twitch
is a good future todo though
if anyone was thinking "is this because you like trig functions and the one paper said to initialize specifically for trig functions" they get a gold star
i like waves okay
... it is possible some of the failed experiments can be salvaged with initializations specific to them but there is no strong reason to choose any of them especially to do this with
i am pretty sure i understand it now and i am envious that i did not write it
it looks like it died
which is a shame, because it is so clever
if there are any of these worth salvaging, that one is worth salvaging
actually I think that's a NaN which is an odd result, we might need to add an epsilon to something
lol, so that UAF did actually diverge. definitely jinxed it
i think i see at least part of issue
taking x_1 as the input, should be:
act = torch.log(1 + torch.exp(x_2*(x_1 + x_3) + x_3*torch.pow(x_1, 2))) - torch.log(1 + torch.exp(x_4*(x_1-x_3))) + x_6
had an extra multiplication in the first exponent which should have been an addition, may have made it prone to diverging
or maybe it's diverging anyway because exponents be like that
who can say
not entirely clear it makes sense to have a trainable activation function per index instead of a trainable activation function for the entire layer
but 🤷
Lol so complicated! Why not just make it resemble a 4 term polynomial like out(a(x)v(x) + b(x)v(x)^2 + c(x)v(x)^3 + d(x)v(x)^4) where abcdv and out are your six nn.linears
I was playing with weird parametric activation functions, performance was highly dependent on initialization, those were non-monotonic functions and they just got stuck if in the path from initialization parameters to optimal parameters function was changing number of minima and maxima
one of them does this or almost this already
it’s not great
worth noting because i am sort of stumped: our constraints are total trainable params of 16*d_model^2 and two sequential matmuls + i guess some wiggle time equivalent to elementwise operations, it doesn’t necessarily need to be strictly a trainable linear up projection to 6 times as large then dim reducing nonlinearity to 1 then matmul
i am sort of thinking of doing the up projection first non-trainably and then down-proj
has PReLU ever worked in real life? (i.e. in practical models)
yeah I see PolyLU but you didn't have the output projection, which is probably super important
worth testing
another option is to vary the activation function by channel/segment, which can give almost the identical polynomial result with way less work
i have no idea
i don’t understand this one
well the polynomial is multiplying by v(x)^n... so if you proj_in then have each quarter of that be multiplied by v(x)^1, v(x)^2, etc. then proj_out
the result is that the initial proj_in chooses what gets treated with which exponent
and the final proj_out can simulate the addition of the components of the polynomial
you can even vary the percentage dedicated to each exponent (including exponent zero)
c = self.coefficients # just some trainable parameters, these don't need to vary based on x
v = Wvariables(x)
y = cat([c[0], c[1]*v, c[2]*(v**2), c[3]*(v**3), c[4]*(v**4) etc. ])
return Wout(y)
probably way better usage of rank
i guess: in principle it doesn’t matter at all if we have a trainable vector in the layer
because it is so small compared to a linear layer
well im just trying to let it train in whatever function approximators it likes using as few parameters as possible
sure, but what is v coming from?
my first impulse is “just make it a trainable vector”
i changed it a bunch above
basic idea is we don't need to spend 4*n^2 just generating coefficients for polynomials, since each set of possible coefficients just represents a different function
actually, they don't have to come from x at all
so uh isn't this just essentially the simplest possible trainable activation function lol
possibly
I’m out for most of the next week probably fwiw
I have seen similar idea in model approximation field. But they merely fit a polynomia func to activations( e.g. Gelu). I think this idea is worth to get a try
cool, trying it right now 🙂
oh, it seems a long time! sorry.
the main problem comes from converging. If you have some results, i 'd like to see the training loss
will let ya know in a few minutes hopefully i'll have some results... just was too busy to try til just now
im using coefficients inits 0,1,1,0,0 so it starts out as x+x^2 and changes from there
saw some folks using that as a replacement for relu
seems to work slightly worse than my existing activation but broadly similar
this is the code I tried:
class LearnedPolynomial(nn.Module):
def __init__(self, dim:int):
super().__init__()
self.c0 = nn.Parameter(torch.zeros(dim))
self.c1 = nn.Parameter(torch.ones(dim))
self.c2 = nn.Parameter(torch.ones(dim))
self.c3 = nn.Parameter(torch.zeros(dim))
self.c4 = nn.Parameter(torch.zeros(dim))
def forward(self, x):
y = self.c0 + self.c1*x + self.c2*(x**2) + self.c3*(x**3) + self.c4*(x**4)
return y
typo on x**3 next to c4
good catch
rerunning with that bugfix
first test was almost indistinguishable from my 'standard' rwkv-style channel mix relu^2 activation (this is on a model using traditional attention tho)
my guess is y = GLUGeLU(old y formula inside) will do a bit better. the logic is to create a flattening at the negative end. but my idea could be nonsense
the concat?
ah
WOW its winning now that i fixed the bug!
good catch indeed
cant believe this worked, even limitedly
another possible reason for GeLU: it increases expressibility by one degree of freedom. for the raw polynomial, if you scale all the input weights and decrease all the output weights, then the overall net is the same. so some of the parameters are redundant.
caveat: ReLU has this same issue and everybody liked ReLU for a long time
crap, I meant gelu instead of glu: the gaussian error unit
@dawn vine Just replace the GeLU to GeLU(LearnedPolynomial(1)(\cdot)) in transformer, train it simply, and let's have a look!
0,1,0,0,0 to make no change as intitial results
well this is all confounded by me using rwkv style ffn at the moment
i can run others ez, say llama2 or whatever, but i dont have the baselines already run
to be clear, the rwkv one works better in my experience
rwkv is OK. no significant difference, I think
well it normally uses relu^2 activation
anyway i can try gelu around it
it started out strong without gelu an ended up being almost identical to the classic relu^2 version over time
oh... relu^2 is relu(relu()), or max(0,x^2)? I have not seen it before
max(0, x)^2
gelupoly winning, but i forgot to zero out the x^2 term
curious about the results, and the learned values from c0 to c4
the whole idea is they may vary per channel!
so may be extremely hard to visualize
it can still be useful to do statistics: if nothing else, it helps initialization
this is fascinating 👀
the other idea here was that this kind of activation may allow us to ditch the expansion/contraction in the FFN
because it lets it model things that otherwise required that
maybe a 2d graph (x channel, y c*)?
the expressive power is really why I thought this might be a useful way to reduce FFN size
by these statistics, I just mean simple bar charts, one for each c_n
yeah. It might be interesting, if we plot GeLU(Poly(x)), and have a look what it diffent from GeLU()
I believe in this. gating methods and Gyges's various methods all require a huge number of weights per input. but simple trainable activations capture much of that without the parameter explosion
good night

I did some followup experiments, nothing really new to report... but I did try a very interesting FFN that didn't do well:
(this is just a sketch of it)
self.coeffficients = nn.Parameter(torch.linspace(0,1,D))
def forward(self, x : Tensor):
return self.w_shrink(torch.cat([self.coeffficients.expand_as(x), x, x**2, x**3], dim=-1))
the idea was along the lines of my original thought, which is that the projection should be able to mix various polynomials out of the ingredients in that concatenated set of learned coefficients and x raised to various powers
the only one in the gist that i have a lot of hope in presently is powerproductpool, maybe add a gelu gate to it
on reflection so I don't stuff this into research:
it's possible arithmetic intensity on the up projection is low if it triggers swaps so that's another axis of tradeoff
worth tracking, anyway
that is to say: it is possible that even though the feedforward looks, mathematically, like it is a low-memory operation compared to attention (which has three entire matrices), so your utilization will be closer to 100% if you make your feedforward stage wider, the up-projection to x4 will trigger cache misses and so eat up a bunch of time swapping stuff into and out of vram when an x1 would not
so you'd gain on wall-clock time substantially when gearing it lower
I was thinking about effective rank, because if FF layers are undergoing rank collapse, might as well add -(effective rank) to the loss. but I think it's not effective at capturing rank collapse, because it's dominated by the single highest singular value.
if one singular value is bigger than all the others, the effective rank will be low. but I doubt that matters
the reason I became interested in rank collapse is this paper, which says attention-only transformers, without FF, collapse to rank 1: https://arxiv.org/abs/2103.03404
Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attentio...
another paper tries to rearrange the attention and feedforwards, and I'm considering how the two papers interact
How about vs where they're the same unified layer, e.g. mamba
how is effective rank measured that leads to this?
iirc there are a couple of ways to do it, none of them are actually svd
same way as in this paper
i confess to not having gone as deeply into that as i would like to have
I don't know the mamba architecture yet, so I can't answer
Mamba didn't invent it, but basically they expand 2x then use that like v then gate and contract
So the attention equivalent is done at 2x
But residual is still 1x
No separate ffn
oh hmm, I read it wrong. I thought it was dividing by the first eigenvalue
oh, you actually have ieee access?
thanks
it appears erank just always gives low numbers. here's an example: I specify an infinite list of eigenvalues 1/x^2. that seems perfectly reasonable and our matrix isn't collapsed or anything
the sum of all these eigenvalues is pi^2/6, so I divide by this.
then I calculate the Shannon entropy: https://www.wolframalpha.com/input?i=sum+of+-1%2Fx^2+*+(6+%2F+pi^2)+*+ln(1%2Fx^2+*+(6+%2F+pi^2))+from+1+to+infinity
and the effective rank I get is 4.94
so this innocuous distribution of eigenvalues has very low effective rank
weird definition
I actually like it other than the fact that it seems to have low informational value
it scales the eigenvalues so they sum to 1. then checks the Shannon entropy. all of these are natural operations
I think it's just really sensitive to the highest eigenvalues. because in my pi^2/6 example, the first eigenvalue, 1, takes 60% of the whole distribution
i will have to think about it, I am not entirely sure how much sense it makes to measure those things
I decided it does make sense. if the input is a random vector, the effective rank should give you a good idea of how many ranks you need
so for a matrix with effective rank 10, using a 100-dimensional low-rank approximation should work ok
it's only if the input vector is non-random (such as if it avoids the eigenvector with the high eigenvalue) that effective rank breaks down
i think this needs to be tested empirically
there should be some function which takes a matrix and its effective rank and gives back a low rank matrix that is a lossless approximation
my guess is it's already been tested and didn't work, so now I'm thinking about why
this won't work: in the 1/n^2 case, you need the whole infinite-dimensional matrix, to be lossless
because effective rank != rank?
would this work if we had a trainable eff_rank size matrix matmul'd into a static actual_rank size orthonormal matrix
(or: some function of eff_rank size)
right. effective rank is only an approximation. so it tells you something like "you'll capture 99% of the eigenvalues" but you'll never get all of them
what I expect is that people have tried low-rank approximations to the low-effective-rank matrices, but that they don't work
oh, that's us. we tried this
so even though the effective rank is small, the whole matrix matters
wait when did we try this
this was the point of this channel
low effective rank suggests that what is needed is a very complex activation function, but not a large dimension
Smerky tried that. no improvements
GeGLU is exactly that
oh, yes, that
i think it needs more trying
but also i am being picky about my env
nanoGPT might work, but I haven't tried, it's on my backlog
i will probably hack and slash a few repos together into something i feel good about and that fits gracefully on my local
i would do pythia but trying to build it gave me hives
will probably mimic existing pythia model size/settings tho
You're welcome to use my training repo (that I keep delaying open sourcing) if you want something that's easy to configure and supports various transformer types. It also can run remotely on vast or wherever bc it streams the pile
I am fiddling with hardware at the moment but would love to look at the code as reference at a minimum
oh btw I also just got a new model working that's relevant here bc it uses a variation on Gated Attention Unit (GAU) 2202.10447
which is a combination of FFN and Attention into a single layer with only a 2x expansion... uses fewer params so you can double the layer count
took some fiddling, but now I have it really performing well (versus equivalents w/ separate 3x expansion FFNs)
having some significant success even with 1.5x expansion with this method where you get to increase the number of layers
makes me wonder if multiple layers in a row of smaller expansion FFNs would have been a more effective use of parameters than traditional transformer alternation of wider ffn and atn
you have to do them sequentially though, no?
i ask because: equiflops, but increases depth of computation and therefore time, maybe
in general depth has always been better than width the question is whether there's a regime with no tradeoff
i think i am personally very convinced that parameters are better off literally anywhere but in the ff
The sequentiality is what provides the expressive power though
Have you seen eqbench? It's a new benchmark for emotional comprehension and intelligence. The Mixtral model, which otherwise matches or surpasses similarly sized models in benchmarks like MMLU, etc., performs barely better than the Mistral models 8x smaller than it
I hypothesize that this is due to the lack of depth. 32 layers is just not enough for complex emotional understanding.
As you said, depth is better than width in pretty much every way, and while deeper models are harder to train, they're also more parameter-efficient. I believe the next open-source models will be much deeper than the ones that we already have.
As an aside, another advantage of scaling depth instead of width is that it gives SSM/other efficient LM architectures a larger hidden state than they would have otherwise.
interesting point, and I think that goes for transformers as well - you just recalc it or cache it
maybe standard ff is just the dumbest possible thing that implements what we want, and we need better tech for channel mixing just like we have better tech for time mixing (called attention and its variants)
meaning that we can't get rid of channel mixing, but we can do it with a much less wasteful algorithm
that's my hypothesis for why moving parameters away from it has seemed helpful to date
... that's an interesting thought, what are all the time mixing variants currently out there?
i am not up to date on the semi-RNNs bouncing around, if any of their mixer steps can plausibly be done with a low parameter budget one of their functions works as what we're calling an "activation function"
and also a half baked theory: it is possible sensitivity to d_ff/d_model would be highly sensitive to initialization. bad initializations will benefit more from the upscale, beacuse it is more likely that some of the parameters are closer to a good initialization since there are more of them. a good initialization may benefit less
Most likely most of what ffn does is just memorizing stuff, so I doubt that in language models anything will be **much **better than MLP
anything which is expressible with a 4x up projection could be expressed otherwise since 75% of the up projection will be linearly dependent on each other
it seems almost optimally inefficient to me
i look forward to hearing of a mathematical nicety that makes this less true
if the nonlinearity is fixed as e^(i c x) for trainable weight c, then 4x up projection gets you 4x the spectrum. although a trainable nonlinearity could probably accomplish similar things
if you feel inclined: we have, as our ideal scenario, a scale up by two. exactly one half of the outputs before nonlinearity are rescales of the other half. we get extra signal because every case where nonlinearity behavior varies based upon the rescale gives us an extra divide between linear regions in the represented function
is that accurate
yes. it's like the second half is a rotation, then a nonlinearity in the rotated direction
linear 1x and on that 4x shift/scale and nonlinearity
if we take relu as our ideal activation i don't see how this works
if your value is below zero so is its rescale
you get no information from knowing that x is 3 and 3x is 9
you should have the same number of linear regions
I'm not sure what you mean by value being below 0, because it's all pre-activation. you can take as an example d_model = 1, d_FF = 2, ReLU activation. then you get two kinks. if d_FF = 1, you get one kink
Just like 4 different parametric activation functions per every neuron
In some cases it will be more efficient but it is some kind of inductive bias
we have some FF + activation; the FF is FF(x), the activation is G(x), and their composition is G(FF(x)) which I all also call H(x) for disambiguation. what we care about is the expressivity of H(x)
no, I am thinking of the input
oh, i will correct
so if we assume x is a vector of breadth (concretely) 4, and our FF is a 16x4, 12 of the values of FF(x) are linearly dependent on the other 4
if G is relu, we should have the same number of linear regions as if FF was 4x4
because a given value of FF(x) will be above 0 if and only if all of its rescales are also
so H(x) is not more expressive, there is no set of things in H(x) with 16x4 FF that couldn't be represented with a 4x4
are you considering FF to not have weights?
no, it has weights
it is a normal boring very vanilla FF
just: it has to be collinear because it is of rank 4
ReLU(1+x), ReLU(2+x), ReLU(3+x)...
oh, you mean bias
And what happens when you add bias
sorry, yes, I meant bias. but note that if one of the inputs is 1, it's equivalent to having bias
okay, bias means i am wrong. i do peripherally remember that people were removing bias from their networks though
Yes and it still works
so bias and no bias are basically the same, if you allow the NN to optionally decide to fix one x_i to always 1
Weird
Oh right
You just need one source of constant value on the input of the network and fnns can emulate bias
i had a previous theory that it might make sense to reduce activation width and just give it a set of constants
this came specifically from the dettmers thing about outliers
where it basically used outlier weights as constants
but that is tangential; thank you for refining my mathematical intuition for this one
(basically: every possible power of two for your precision)
intuition was that if networks were doing a lot of work to create constants you should give them constants so they could not
That's called bias 🤣
sure, i don't think that bias has been done
trainable bias can in theory do this but in practice has to work to do it
when it is so simple that it is basically a noop
i don't know what portion of grad has to be oriented to creating a +50 or +100 weight in a well-trained llm but it has to be some significant fraction of it if it's perpetually pushing that value up and keeping it from going down
bias neurons like this didn't show huge gains in a previous era but previous eras didn't have well-trained networks spontaneously deciding a single index had to always be unbelievably high
Is scale of the value making a difference in context of optimizers like Adam?
you can just relax the weight decay for biases if you need a huge one. if you init huge biases, they will be huge, but in the wrong direction
i don't see why it doesn't make sense to just give the network values equal to each allowed power of two to calculate with for when it wants a hard bias input
it can ignore them if it doesn't want them
it consumes a negligible portion of width, e.g. 17 values for float32
for clarity: this is a bias activation, not a bias that ever gets applied. it's a constant value that is always there
so every FF would go from eg 1024x1024 to 1024 to 1007 and then you'd concat in your biases
so the middle neurons of the FF layer are like {H(x), 1, 2, 4, 8...}?
yes, and similarly for negative powers of two
this is completely tangential to the previous thing
just thinking about using inputs as constants reminded me of it
1, 2, 4, 8 are all equivalent as inputs; all they change is scale of gradients and weight decay
specifically allowing linear combinations of them makes all numbers representable in the next output
but there are weights multiplied to these biases. so 1w = 2 (w/2) = 4(w/4) = ...
yes but weights are trainable and don't like to fix directly on integers or powers of two or any such direct numerical thing
so what you mean is {H(x), 1, 2, 4, 8...} and then not multiply the constants by weights?
the next layer can use it for whatever it wants to use it for, my first guess was that since outlier weights seemed designed to create freakishly large activations to use to zero out inputs ... sec, cite: https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
it made sense to just always have an available arbitrary-sized activation
and then if we were doing one because the network needed one for a specific numerical reason it made sense to try to do a number of them
not because I have any operation I specifically want to enable, because in principle you can do different operations with a different set of constants
because "some numerical operation that networks may find useful might become computable, or more easily computable, if they have a set of constants across scales"
(my first thought was just INT_MAX because, again, it enables this zeroing operation)
reading through Dettmers's post, these constants will not accomplish the same thing. because his outliers move to different dimensions over mini-batches
so they have actual signal
when you create these constants, you are "making them available" but what you're really doing is bypassing weight decay and gradient scaling for these particular dimensions
his outliers eventually fix to single indices
in particular, if you have a 2^30 constant, the rest of your network will never train
but they don't remain constant
assuming he means "residual stream" by "hidden state", my guess is that he's using pre-norm transformers so the purpose of these outliers is to change the scale of the output of layer normalization
neat
e^(icx) is so elegant that i want it to work as a nonlinearity by the way
MLP is generally great, it's just costly... maybe there is something less expensive that does nearly as well even for this 'memorization' style task?
maybe the question how compressible is trained mlp should be answered first?
i think the general answer is that they are absurdly compressible once trained but this does not mean they can be compressed when training
what makes you think that?
you can sparsify and quantize networks with relatively little performance loss
i would say "no performance loss" but people have apparently stopped caring if their compression techniques were lossy
not very long ago people still did that
and it generally worked
quantization yes, but that is more general thing that suggests that we use unnecesary high precision in general, sparsifying I don't think so
pretraining in lower precision doesn't work
sparsifying is more like quantization of some number of weights to 1 bit
sparsifying is throwing out those weights completely
not at runtime, no
you don't understand what I'm saying or you talking about some kind of structured sparsifying?
so subset of weight matrix are zeros so those don't need to be multiplied, but you still need to store binary matrix pointing which ones are zeros
How I understand the FFN is a key-value store. q (the input) is matched with k (up_proj) to produce the "attention" scores that determine which items in v (down_proj) are added to the residual stream.
The LLM stores patterns in k, and what it should think if those patterns are matched against in v.
This interpretation makes me suspect that k is overparameterized. While there are probably some patterns that are dependent on all dimensions of the input sequence, it seems more likely that the majority of patterns do not. We could have multiple separate up-projections, some dependent on some dimensions of the input sequence and some dependent on all dimensions.
We might also consider how LLMs might be forced to store knowledge. If FFNs are a key-value store, there are only so many keys to match against. Likely, the LLM divides knowledge into categories and assigns each category a key-value pair. When the key detects that the input is looking for knowledge from that category, it spits out all the knowledge it has about that category into the residual stream for future access.
This seems extraordinarily inefficient. A better way might be to have an extremely large number of keys that match with a subset of the dimensions of the input sequence, coupled with an extremely large number of values that output onto a subset of the residual stream. This would allow for the LLM to implement fine-grained knowledge storage and retrieval.
my project is somewhat similar but more mathematically oriented: [a manifold is an atlas of charts]. then, a token is a collection of {low-dimensional vector, number that specifies an array index}. the array index is a lookup into a table of charts. charts have matrices which specify how vectors in two charts interact with each other, and these matrices form a graph
if you don't understand the part in [ ], you can skip it
actually maybe not; the idea looks nonsensical unless you have intuition about how manifolds should behave
if every key is from a different subset it is just sparsifying and sparsifying is just a special case of quantization, if those subset are relatively small (so most of the entries in the matrix are zeros), then kolmogorov complexity has to be low, but inference on gpus will not benefit from that, only if somehow we can find some kind of structure in that sparse matrix, we can leverage that and design more efficient architecture with the right inductive biases
"block sparse" is what I'm using, but it's more complicated than sparsifying existing dense networks
i realized later that this was correct, that said i think sparse matmul gets more efficient for this when very sparse (eg when you zero entire rows or columns)
and iirc you can usually sparsify so far you have to be zeroing out entire rows
can confirm that i don't understand it
or blocks i guess
here's a non-mathy explanation: a residual stream needs to represent every possible concept that might want to be represented, simultaneously. so if we want to represent a car in the residual stream, we also have to represent that car's relationship to Paris - zero. and its eloquence when giving a speech - zero. most concepts simply aren't related. so it'd be better if we could have a low-dimensional vector, specifically for car-related things, and not have to store all the unrelated parts
but if we do this, then our low dimensional vectors no longer have any obvious relationship to other low dimensional vectors. we must store those relationships between pairs of related concepts. so if we have cars and doors, and we want to do Q and K stuff with them - in the regular transformer, we apply Q and K to the giant-dimensional vector. here, we must store a transition matrix that relates cars and doors to each other
this makes sense to me, thank you
@fallen spear I didn't read the whole thread but I see that you were using graph from https://arxiv.org/pdf/2310.19956.pdf that suggests that for small models standard mlp ratio of 4 is too big, bad and wrong, this is complete BS because their 41M model with mlp ratio=4 is two layer model and that was the reason it is junk, not the mlp ratio, for mlp ratio=1 it has 4 layers so enough to be working okish and thats why it is better then, and also I think that all their 41M models are highly suboptimal because they just set model dim for 41M parameters wrong.
the reason swiglu underperforms geglu seems to simply be the scale of x. if you use swiglu(1.702x)/1.702, it matches geglu. and likewise, if you use swiglu(x/1.702)*1.702, it does a lot worse https://wandb.ai/ad8e/tinystories3?workspace=user-ad8e
however, I have to go play tennis and will be back in a few hours
that probably means init is really important, and that I have to figure out muP things
most of the variants on a(x1) * b(x2) * c(x3) perform basically the same, for functions a, b, c
they are all much better than regular GeLU and all have comparable performance to each other and GeGLU, as long as a b c are ok-ish
for example, exp cannot be any of the three
ReLU^2 is giving pretty bad results. I think because the gradient is not capped, and instead grows as |x|
logsigmoid(x) * sin(x) also has bad results and shares this gradient issue
testing results on small models:
High confidence:
- multiplying multiple activation functions is better than not multiplying. that means these things all work and are much better than GeLU:
GeLU(x1) * x2 = GeGLU
GeLU(x1) * GeLU(x2)
GeLU(x1) * sinc(x2)
GeLU(x1) * ReLU^2(x2)
GeLU(x1) * ELU(x2)
x1 * logsigmoid(x2) * sin(x2) - you can go up to 3 and the performance is the same, up to noise:
tanh * x2 * logsigmoid
gelu * x2 * elu
GeLU * linear * linear
Medium confidence:
- scale of the gradient of an activation function needs to not grow too fast
exp is unsuitable as an activation function, though different inits may change things
quadratic (ReLU^2) may build up gradient issues. this may be bad at higher scales. same with logsigmoid(x) * sin(x). multiplying two quadratics may cause issues
theory speculation, unconfirmed by experiment:
symmetry seems to be bad. so GeLU(x1) * GeLU(x2) is worse than GeLU(x1) * x2. and x1 * x2 is worse than GeGLU
linear also has symmetry. maybe there would be a better activation function than linear, in GeLU
perhaps only one of the activation functions should have a 0-zone
my current best activation function is tanh(x1) * x2^2
it does a little better than GeGLU, but the improvement is much less than the improvement of GeGLU > GeLU
a lot of the time this stuff doesn't hold up on larger models, annoyingly
@boreal moss tried a ton of stuff and found some good ones for small models tho
my models are 10M, trained for 50M chars. the benefits show early in training and then are marginal later, but persist. they're above the noise floor but they're not that important
I am a bit skeptical of cubic functions right now because functions with higher gradients seem to show instability
@soft bobcat Yes, but when I added to "trilinear" ffn a "very leaky" tanh looks like it tolerates higher learning rates
fc1 = nn.Linear(dim, ffdim*3)
fc2 = nn.Linear(ffdim, dim)
x = self.fc1(x)
x = torch.tanh(x) + 0.1*x
x = torch.chunk(x, 3, -1)
x = self.fc2(x[0] * x[1] * x[2])
didn't tested this extensively, just run one test and compared to no activation function for dim=256 ffdim=512
sounds reasonable, the tanh should restrict the gradients to a reasonable range
the three-part FFNs I tested were no better than GeGLU; mostly, they were equivalent. maybe I didn't look in the right places
this probably explains why the effective rank was so low
changing GeLU(x) -> GeLU(x) * x performs equivalently to GeGLU = GeLU(x1) * x2 https://wandb.ai/ad8e/tinystories single activation?workspace=user-ad8e
SE Gyges has been gone for a week now
i hear discord gets really salty if you do a chargeback on them
you might want to tag him on twitter if it's important, otherwise will be back when back
if you want to do some clever math, consider assuming that the existing initialization is mUP (even if not) and modifying the init accordingly when you modify the activation
what "accordingly" means I have no idea
I mentioned Gyges a few times to send him info, which can be found by Ctrl+F mentions of his name
Appreciate it 🙂
all initializations for a fixed model are muP, in the old meaning of muP. because muP did not give any indication of correct inits, and only said how to transfer inits from one scale to another
the new muP paper does give math on how to scale params, and I'm working it out on my model now
i would guess that for very simple modifications (ie reducing layer width) you can simply rescale to maintain the infinite-width limit
for less simple ones like geglu god have mercy on your soul
I think all the transformer activation functions are pretty simple now
muP only tells you how to scale. it's mostly independent of your activation function
and what you need to figure out for the activation function is what the multiplier for that function should be. which is guesswork
at least, for transformers this is true. for a series of FC layers without normalization, it would be more difficult
i think i finally have an environment i approve of as a reproducible lab for a/b tests and all i had to do was build a machine, rework the dockerization of gpt-neox, and make a new branch off of v1.0 of it
my standards for reproducibility are possibly unreasonable high
this table is also here: https://arxiv.org/abs/2001.08361
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within ...
if anyone has straight tested this at higher sizes I have not seen it
now that i have an environment i approve of i think i a/b test at ratio 4, 2, 1, with and without scaling the initialization
this one shows similar minimum and simultaneously proves my point, here ratio=0.5 is worse than ratio=4
ratios below 1 are always bad, and this seems uncontroversial; if there is a clear demonstration that ratios above 4 are especially good I have not seen it
it is a little bit frustrating that there is relatively little good testing on this count but the information we do have does not seem to indicate strongly that the up projection has a very large effect; it has an effect, but it is not large, and allocating the same parameters to either depth or attention heads seems to outperform
have you seen the mamba architecture? single block type does both FFN and Linear Attention
ratio is 2... but they do the attention in the expanded dimensionality and use twice the layers (giving them twice the attention)
you could try it with 1x expansion instead and 4x layers and compare
if i switch to looking at mamba arch i will have to tweak my environment until i am convinced it gives me a clean a/b test of a hypothesis with it and this will make me very sad
or at least be time consuming
it is a good idea though
i have "understand rwkv and mamba" on my todo somewhere
what are you comparing currently? pythia?
what size model? blink tends to chide me that I have to test on L32D2048 or it doesnt always hold up for larger models
i've 'discovered' amazing things that work great on smaller models like L12D768 but die at 400m tok trained on larger ones
this kind of scale effect may be especially relevant here for FFN sizing
its pretty annoying to train a model that large tho 😦
yeah i am trying to make sure my setup has a 1:1 larger-scale analogue so it's easier to bump upwards
effectively: rerunning same analysis is a matter of swapping out like three params and running it on a real node
time to read muP! https://arxiv.org/abs/2310.17813
(I'm still figuring out depth scaling though)
ngl i was just going to blindly jump pythia sizes
if you're figuring out mup perfectly gpt-neox needs its mupperizer fixed
oh, but also: if i scale from ratio of 4 to 2 or 4 to 1, what is the right correction to apply to the init
if i'm doing it nice and manually
this is the initialization and LR I'm using; it takes in i and o, the input and output dimensions
currently I'm waiting on my boss to see how much I'm allowed to talk about
gpt-neox is using the old mup repo so it will never be fixed unless someone PRs an implementation of the new spectral muP paper
the old mup repo is useless
deepseek MoE seems to do fine with them
i wonder why
Probably the MoE part, yeah? Though you're still limiting max rank of the model to the (smaller) expert size. Even if a single expert is low rank the sum of experts can still be high rank?
i... don't feel like it works that way? you're only summing the outputs of the ffn, not the hidden states
The sum of two linear maps has dimension >= the max of both maps' ranks, and with K experts + top K sampling you're giving the model more chances (& incentive?) to separate signals from individual experts and produce a higher rank map per MoE layer. If the MoE layer projects two inputs from different time steps into two different subspaces, then the attention mechanism could restore information lost by the low rank projection(?). So I guess you're not specifically bounded by the smaller expert dimension? I'm probably wrong somewhere; I need to sleep.
you're only summing the outputs of the ffn, not the hidden states
I think I need more explanation on this (bc I'm oblivious); I probably addressed the wrong issue
you're probably correct i just don't know how linear algebra works
have you not done a crash course that makes it to matrix rank yet
i had not read this previously
this is roughly the same trick as multihash routing just without the hash
you define your standard sized ffn as many smaller ones
it is equivalent
you do however get to do routing many times
in the limit you define experts of size one matrix row each and route separately for each
it is a good trick and i have been trying not to think about it bc there are too many things you can do with MoE
I have to ask, what exactly we are searching for now here?
previous testing was all some combination of things which altered the d_ff/d_model ratio at equiparams a la geglu
will probably do that again
i personally still want to a/b test directly simple ablation on ratio
i just got lost for a prolonged period when trying to ensure i had a reproducible setup
my standard for 'reproducible' comes from maintaining build systems and is somewhat more stringent than is considered normal or sane in ml
or in maintaining build systems
your point was to find best parameter efficient ratio right? I think the problem is that you can't really decouple dff/dmodel from other things, so you can't really measure what you want to measure
i can literally just reduce the degree of up projection and it's a standalone change
but to stay at equiparams, yeah, you have to fiddle the entire ff setup, again similar to geglu, if you want to keep things "the same" while fiddling activations
There may be a useful trick on how to change the dim of the model without changing the dim of the attn layer, just calculate the attn layer on a part of the tensor, and the rest go straight through, this is used in some CNNs for other reasons and works just fine, should work on transformer, example:
for ratio=4 you use model_d=1024 ff_d=4096 normal attention layer
for ratio=1 you use model_d=2048 ff_d=2048 and put half of the tensor through attention layer, another half go straight through.
oh that's neat
i think we only need that if we are moving params out of the ff into the d_model and we want to leave the attn alone
bt noted
so, if 1x ratio is okish for standard FFN, bilinear types with the same parameter count will have 0.66x ratio, are they still better or this breaks at this point? anyone tested this case?
since the theory here is that ffn_ratio is an inefficient use of parameters, do you guys have opinions on DeepSeekMoE? it adds together the results of a ton of different low ffn_ratio networks (selected as 'fine grained expert segments') to create a high ffn_ratio network akin to a traditional high ffn_ratio expert/FFN
the idea being that a single FFN w/ ffn_ratio=4 is the same as the sum of 8 smaller FFNs w/ ffn_ratio=0.5
also, @fallen spear have you considered increasing d_model but reducing both FFN and attention dimensions as a potentially maximally effective use of parameters?
i think i would want to trade off d_model and attention separately
attn is its own can of worms
yeah but maybe a huge embedding size is whats really important
it is definitely one of the main important things
i'm doing a lot of work with MoE at the moment so I'm thinking about this stuff a lot since its inherently very FFN centric usually
that is part of what lead me to being this deranged about FFNs, yeah
do u have opinions about what I was saying about deepseekMoE above?
i have a pre-existing obsession with "hash routing", which should probably not be called that, and one of the things they did in that paper is in the same vein as deepseekmoe
@dawn vine #off-topic message previous rant here
i had another one where i was speculating about routing with an lsh, as i sometimes do and should really get around to testing
that second one is something @boreal moss asked me to try in my MoE experiments just the other day 🙂 i havent yet tho
since apparently at least three people have had this idea if you do not do it now you will get scooped
im fine with getting scooped, but i will also try it 😉
it seems like that should work, and also splitting into successively smaller and smaller merge-able experts a la deepseek/multihash should work
and they are both free
ie, they do not make the model more expensive in any meaningful way
theres literally no reason it shouldnt work as well as layers, since at worst its just the same
routing fails to train
for both
assuming routing is well-behaved, both approaches are strictly better than layers ordinarily are. maybe cache behavior is worse though
you cannot anticipate which layer is next
but assuming routing does not degenerate and cache behavior does not kill you they both seem like sure things
this is all for a bolt-on to RWKV and I have the interesting problem that token-shift is not convenient for MoE, so for now I'm using untokenshifted additional experts and considering the base rwkv chanmix as a kind of 'shared expert'
i know what some of those words are
I will rephrase: this is all for a bolt-on to RWKV, which uses a non-standard FFN, so I'm just keeping the non-standard FFN as is and adding (literally adding the results) of extra experts
