d_ff/d_model + swiglu tests | EleutherAI | Page 1

fallen spear Nov 2, 2023, 8:56 PM

#

#

Created so I can stop spamming #research and #off-topic with it

#

original thread in #off-topic begins here: #off-topic message

#

thread in #research : #research message

#

Hypothesis to test:

d_ff/d_model is optimal at exactly 1
swiglu and geglu are worse than gelu, and specifically only appeared to be better because the adjustment for isoflops brings d_ff/d_model closer to one

glacial schooner Nov 2, 2023, 9:09 PM

#

in the swiglu/geglu case they adjust the FF ratio without changing param count and without changing the fraction of params allocated to FF vs attn

#

(bc the meaning of "FF ratio" changes with those activations)

#

with a non-GLU FF, if you change the ratio, then either you change total param count or you change the FF vs attn balance

#

so it's not obvious to me that "use the FF ratio from the GLU paper, but without the GLU" has a clear meaning

fallen spear Nov 2, 2023, 9:11 PM

#

Proposed test:
Retrain pythia with d_ff/d_model from {1, 2} and activation of {gelu, swiglu, geglu}
Will do smallest pythia first

glacial schooner Nov 2, 2023, 9:11 PM

#

so, not keeping the param count identical?

fallen spear Nov 2, 2023, 9:11 PM

#

glacial schooner so it's not obvious to me that "use the FF ratio from the GLU paper, but without...

the ff ratio in this case referring to the width of the activation

#

not the number of parameters

#

isoflops is a bad test

#

isoflops requires you to alter at least two hparams at once

#

in this case specifically if swiglu/geglu are worse or equal they have more parameters

#

so they really should be better

glacial schooner Nov 2, 2023, 9:14 PM

#

if only d_ff is changing, it seems intuitive that larger d_ff would perform better (up to some point where training goes unstable or smth). bc the larger d_ff model "could" always just ignore some of its neurons, if that were optimal

#

but ofc that's just a guess.

fallen spear Nov 2, 2023, 9:16 PM

#

Okay I am tired of typing the whole expression so I am going to call d_ff/d_model “hidden ratio” henceforth

#

I have seen zero empirical evidence that hidden ratio above one is good

#

So far as I can tell it was set at four in the original transformers paper and never questioned

#

Swiglu/geglu modify it to maintain isoflops

#

For weak indica that one is the optimal ratio see images attached to this post

#

Goal is to get stronger indica

glacial schooner Nov 2, 2023, 9:17 PM

#

makes sense, if you're keeping total params constant

#

(which was true in both of those images)

#

otherwise, higher ratio = more params, and it seems hard for more params to hurt given the general smooth scaling properties of transformers

#

another way to look at it is: adding a new FF block is not too different from doubling the size of an existing one. like if the depth is already large, and representations are changing smoothly across layers

fallen spear Nov 2, 2023, 9:21 PM

#

glacial schooner (which was true in both of those images)

I am unsure if this is the case

#

The new paper says it keeps d_model constant

glacial schooner Nov 2, 2023, 9:21 PM

#

so it's hard to have a situation where half of the FF neurons at layer N are actively harmful, without it also being the case that if you added layer N+1, its entire FF would be harmful. which would be wild

glacial schooner Nov 2, 2023, 9:22 PM

#

fallen spear The new paper says it keeps d_model constant

they did, and varied d_ff and layer count together to maintain total params constant

#

hence "41M," etc labels

fallen spear Nov 2, 2023, 9:22 PM

#

I’m happy with a totally flat line across hidden ratio

#

It proves the point cleanly

fallen spear Nov 2, 2023, 10:04 PM

#

glacial schooner they did, and varied d_ff and layer count together to maintain total params cons...

I'm going to fuse their two tables that report relationship of layers to hidden size and that report results and graph them, I think

fallen spear Nov 2, 2023, 10:41 PM

#

#

📎 impact_depth_width_results.csv

#

In fact, cleaner to show inflection:

granite plover Nov 2, 2023, 11:23 PM

#

Could remake this plot with the x axis in log scale?

fallen spear Nov 2, 2023, 11:58 PM

#

granite plover Could remake this plot with the x axis in log scale?

#

the uptick in the biggest model in ppl as you reduce dim is the only datapoint that looks at all favorable, the others only gain ppl after the hidden size drops below the model size

dawn vine Nov 3, 2023, 1:12 AM

#

regarding the stated hypothesis, from personal experience (including a single training run I just did to make sure) for small models (say 123m params) ff_ratio>1 is better than ff_ratio==1 when keeping everything else the same and training for the same number of tokens
whether or not it's a worthwhile tradeoff in place of other kinds of resizing, I don't know

#

maybe if you adjust for the amount of layers to make the same total parameter count, ff_ratio=1 is optimal; that's what impact_depth_width_results.csv implies to me

dawn vine Nov 3, 2023, 2:00 AM

#

fallen spear Hypothesis to test: 1) d_ff/d_model is optimal at exactly 1 2) swiglu and geglu ...

Ran three short 50mm token tests, and if adjusted for parameters by adding layers, your hypothesis won:

params=123m ff_ratio=3 n_layers=12 d_model=768 loss=3.664
params=123m ff_ratio=1 n_layers=18 d_model=768 loss=3.656
params=123m ff_ratio=1 n_layers=12 d_model=900 loss=3.666

But with caveats: despite having the same parameter count as the first model, the 18 layer model was about 25% slower in actual training time per token

tardy trench Nov 3, 2023, 2:03 AM

#

dawn vine Ran three short 50mm token tests, and if adjusted for parameters by adding layer...

It would be interesting to control for wall time as well as parameter count, and see whether the optimal ratio changes.

One recent paper that controls for wall time is Kaddour et al., 2023 (optimizer evaluation). They adjust the learning rate schedule and number of training steps, and measure perplexity after training only.

dawn vine Nov 3, 2023, 2:06 AM

#

yeah, the other thing to take into account is that mundane things like autograd may cause your memory usage to be a lot higher with 18 layers than 12. Mine certainly was.

tardy trench Nov 3, 2023, 2:08 AM

#

dawn vine yeah, the other thing to take into account is that mundane things like autograd ...

That makes sense. For smaller model sizes, I also wonder if GPU utilization might be a bit lower for models with narrower d_ff, which could also contribute to the slowdown.

dawn vine Nov 3, 2023, 2:11 AM

#

tardy trench That makes sense. For smaller model sizes, I also wonder if GPU utilization migh...

Easily possible. And as your size increases, you might need an insane number of layers to displace the parameters generated by ff_ratio=4

#

Despite the slowdown, I think this ff_ratio=1 may be a good tradeoff, at least at lower scales... it seemed a lot more effective than adding parameters in more traditional ways in terms of speed versus loss improvement!

fallen spear Nov 3, 2023, 2:15 AM

#

dawn vine Ran three short 50mm token tests, and if adjusted for parameters by adding layer...

If you want similar wall clock time at isoflops increase model dim instead of layers

#

will also increase attn cost but that amortizes with scale

dawn vine Nov 3, 2023, 2:17 AM

#

argh very sorry, I quoted bad numbers above.. the result is the same but it's a closer call than I thought

#

will update in a minute

#

fixed above:

params=123m ff_ratio=3 n_layers=12 d_model=768 loss=3.664
params=123m ff_ratio=1 n_layers=18 d_model=768 loss=3.656
params=123m ff_ratio=1 n_layers=12 d_model=900 loss=3.666

dawn vine Nov 3, 2023, 2:24 AM

#

fallen spear If you want similar wall clock time at isoflops increase model dim instead of la...

ok, trying that now

fallen spear Nov 3, 2023, 2:25 AM

#

dawn vine ok, trying that now

wait, if that’s not what the first row of most recent is then how did it gain params?

#

i think param count off?

dawn vine Nov 3, 2023, 2:27 AM

#

yeah sorry i keep making bad edits

#

it was right above, i just tried to copy/paste and deleted part and retyped it incorrectly 🙂

fallen spear Nov 3, 2023, 2:28 AM

#

you good, i just don’t wanna be reasoning from off numbers

dawn vine Nov 3, 2023, 2:33 AM

#

even slower with bigger d_model

#

oh one other thing I should mention, this is a little nonstandard - it's using essentially the rwkv ffn and my own modified attention sublayer so take it with a grain of salt

fallen spear Nov 3, 2023, 2:43 AM

#

i'm going to guess it has to break the matmul into two ops

dawn vine Nov 3, 2023, 2:43 AM

#

but it should be a reasonable approximation

fallen spear Nov 3, 2023, 2:43 AM

#

dawn vine oh one other thing I should mention, this is a little nonstandard - it's using e...

ngl i like rwkv better

#

main reason to use a standard transformer is just for the 1:1 comparison

#

does make it harder to reason about why it would be breaking the matmul up though

dawn vine Nov 3, 2023, 2:44 AM

#

yeah i just had this lying around and figured id run one experiment.. then that became three

fallen spear Nov 3, 2023, 2:44 AM

#

i am blessed with bad local hardware which gives me the go-get-a-runtime friction to prevent me from yoloing

dawn vine Nov 3, 2023, 2:45 AM

#

its easy for me to run it on standard models - i have them implemented too - just had the data from a run here already

dawn vine Nov 3, 2023, 2:45 AM

#

fallen spear i am blessed with bad local hardware which gives me the go-get-a-runtime frictio...

hehe yeah im running this on some remote 4090

fallen spear Nov 3, 2023, 2:45 AM

#

dawn vine hehe yeah im running this on some remote 4090

vast? i keep hitting places that have 0 availability

dawn vine Nov 3, 2023, 2:45 AM

#

yeah, ive used jarvis and vast mostly

#

agreed about the 0 availability, it sucks

#

i almost went out and bought a 4090 a couple months ago bc of that

#

i may still

fallen spear Nov 3, 2023, 2:46 AM

#

i am broke but tempted to buy a used a100 on credit which is not a good impulse

dawn vine Nov 3, 2023, 2:48 AM

#

4090 is fine if you can get along with 24gb

#

roughly same speed as a100 in my experience

fallen spear Nov 3, 2023, 2:49 AM

#

memory constraint limits some stuff but fair enough

#

… wait I do have access to a 4090

#

here’s hoping he’s awake

dawn vine Nov 3, 2023, 2:51 AM

#

surprisingly, it looks like the larger d_model version is only about as good as the original wide ffn

#

updated the chart above

#

well, that was interesting!

fallen spear Nov 3, 2023, 2:55 AM

#

thanks, it does suggest pretty strongly that 1 isn’t actually the good number

soft bobcat Nov 3, 2023, 2:56 AM

#

do you have estimates of the standard deviations of these losses, even if they're just manual guesses?

dawn vine Nov 3, 2023, 2:58 AM

#

soft bobcat do you have estimates of the standard deviations of these losses, even if they'r...

sorry, I don't... but they stayed in a very consistent horserace so they're not wiggling around a lot

soft bobcat Nov 3, 2023, 2:58 AM

#

no, that's perfect. it's exactly the kind of plot that we can stare at to eyeball std devs

dawn vine Nov 3, 2023, 3:01 AM

#

well, interestingly 1 was the good number here, if you didn't care about runtime and adjusted for parameters by adding layers

#

I didn't necessarily expect that, though I really had no idea what to expect!

#

all I knew going in was that 1 is severely suboptimal if you can just add size to the FFN without worrying about cost

fallen spear Nov 3, 2023, 3:03 AM

#

i think the paper that set me off did that tradeoff and it continues to be true until about 400m params with ratio at about 2

#

ratio below that not tested for that many params

#

i went nuts because i’ve been bothered by a lack of ablations to d_ff for a while

dawn vine Nov 3, 2023, 3:05 AM

#

makes sense - i was definitely interested when I saw this thread and thought about it at all

fallen spear Nov 3, 2023, 3:06 AM

#

gotta say, the relatively small gain in loss for throwing out 75% of the parameters still an interesting result

#

but those are some nice and steady losses, it’s impressive actually

dawn vine Nov 3, 2023, 3:08 AM

#

its on a small randomized part of the pile validation set

#

the training loss chart looks a lot choppier 😉

fallen spear Nov 3, 2023, 3:09 AM

#

lol fair

soft bobcat Nov 3, 2023, 3:09 AM

#

since I'm studying learning rates right now, I'll comment that I believe learning rate is supposed to decrease when depth increases

fallen spear Nov 3, 2023, 3:10 AM

#

soft bobcat since I'm studying learning rates right now, I'll comment that I believe learnin...

… actually that explains a result in the paper

#

loss goes up a smidge as they max out on depth towards the end

#

hparams identical across all runs

dawn vine Nov 3, 2023, 3:10 AM

#

soft bobcat since I'm studying learning rates right now, I'll comment that I believe learnin...

well maybe its even better than my experiments implied then, since I added 50% more layers but didn't adjust LR at all

fallen spear Nov 3, 2023, 3:30 AM

#

yeah uh

#

this table looks real noisy to me

fallen spear Nov 3, 2023, 4:44 AM

#

....

#

is this actually telling me the effective rank of a ff layer of this size is less than ten

fallen spear Nov 3, 2023, 5:08 AM

#

oh, references: https://arxiv.org/abs/2001.08361
https://arxiv.org/abs/2310.19956

dawn vine Nov 3, 2023, 2:26 PM

#

fallen spear is this actually telling me the effective rank of a ff layer of this size is les...

I think yes, in the sense of something a bit less than 'how many linearly independent columns/rows are there'

#

what's interesting about it is, that at 12 layers (for this d_model=134m), it seems 'optimal' in that you pack as much linearly independent info as possible into each layer

#

[probably not] coincidentally, that's exactly how many layers people usually use for such a d_model

fallen spear Nov 3, 2023, 2:42 PM

#

dawn vine I think yes, in the sense of something a bit less than 'how many linearly indepe...

I have to recheck but it seems somewhat at odds with results here: https://arxiv.org/abs/2206.06072
they used vision transformers and a more involved type of decomposition

dawn vine Nov 3, 2023, 2:50 PM

#

the main result I see is that it looked like in both 134m and 374m they were optimal ppl at around 2.66x ff_ratio, but a lower ff_ratio more like 1.5 for the tiny 41m model

#

which interestingly does not at all match my tests from last night

#

"When 𝑑model/𝑑ff > 1 (red dashed rule), perplexity slowly increases"

#

I don't see that as being true in their data at all

#

perplexity looks clearly lower to the left of that red line for all of the models

#

not hugely, but it's still lower!

fallen spear Nov 3, 2023, 3:03 PM

#

fallen spear In fact, cleaner to show inflection:

i mean this is from the same data

#

and the csv above it

dawn vine Nov 3, 2023, 3:04 PM

#

yeah, im talking about the ratio not the ff size which is what your plot shows

fallen spear Nov 3, 2023, 3:04 PM

#

i am trapped on mobile but i think the only one that gets worse before crossing ratio of 1 is the biggest one

#

yeah you have to go get d_model to figure where it should be on that plot

#

for each curve

#

can also check the csv for it

#

i am ignoring every metric besides validation ppl tbh

dawn vine Nov 3, 2023, 3:06 PM

#

oh i didnt expand the csv, only saw first few rows... looking now

fallen spear Nov 3, 2023, 3:07 PM

#

i guess the middle size gets slightly worse

#

but they also … add four layers

dawn vine Nov 3, 2023, 3:09 PM

#

wow my eyeballing was almost exactly right

#

best ratio for the biggest ones was around 2.66

#

smallest one very hard to tell from the data

#

but likely somewhere between 0.7 and 1.4

#

as mentioned, this is directly in opposition to my test results last night

#

which showed 1.0 as being useful when increasing layers accordingly

#

(w caveat of slowdown in training)

fallen spear Nov 3, 2023, 3:14 PM

#

i would suspect hparam issue for depth

#

but also, that’s a different and weaker hypothesis to test than that hidden dim is useless or only marginally useful

dawn vine Nov 3, 2023, 3:14 PM

#

i just ran more tests w/ different LR for the deep model.. limited evidence but lower LR was not better

dawn vine Nov 3, 2023, 3:15 PM

#

fallen spear but also, that’s a different and weaker hypothesis to test than that hidden dim ...

yeah it may be only marginally useful above 1.0

#

but in reality I'm not sure I can train fast enough with deeper models to make it 'worthwhile' to go to 1.0

fallen spear Nov 3, 2023, 3:16 PM

#

dawn vine yeah it may be only marginally useful above 1.0

i’d want to establish exactly how marginal and then where else equal params can go, tbh depth seems like the worst contender

#

specifically bc it harms both training and inference time like that, yeah

dawn vine Nov 3, 2023, 3:17 PM

#

got other ideas besides increasing d_model? that didn't pan out for me

fallen spear Nov 3, 2023, 3:17 PM

#

switch activations to something fancier

#

add attn heads

#

too involved to test on a lark, but: add attn sink tokens

dawn vine Nov 3, 2023, 3:18 PM

#

yeah increased attn size/heads could be helpful

fallen spear Nov 3, 2023, 3:18 PM

#

which doesn’t add params but does increase flops

dawn vine Nov 3, 2023, 3:19 PM

#

i will try more heads (so bigger total attn size)

fallen spear Nov 3, 2023, 3:19 PM

#

dawn vine got other ideas besides increasing d_model? that didn't pan out for me

MoE also

#

can do moe with four experts at ratio 1 for same cost

#

eight experts if only the down projection is an expert

#

i guess i am assuming no params in routing function or that such params are negligible

dawn vine Nov 3, 2023, 3:21 PM

#

i cant try MoE easily, and personally I'm more interested in practical ways to increase function of the base transformer than adding different architectures like that into the mix

#

but i can try heads/attention sizing easily

#

that's not to say MoE isn't a valid idea

fallen spear Nov 3, 2023, 3:21 PM

#

both moes and factorization are hobby horses i’d want to keep as last resorts

#

bc they don’t test the hypothesis cleanly

#

entirely separate cans of worms

fallen spear Nov 3, 2023, 3:23 PM

#

fallen spear both moes and factorization are hobby horses i’d want to keep as last resorts

eg another indirect test is to try to replace well trained up projections in existing llms with similar ones of ratio 1

#

but much like quantization that could only work after training is done

#

so i’m not chasing it because it doesn’t test the hypothesis meaningfully

#

oh you could also have up to four separate ff + geglu for equiparams, four times slower though

dawn vine Nov 3, 2023, 3:55 PM

#

I ran a new test with ff_ratio=1 and more attention heads instead of more layers and it did the best so far

dawn vine Nov 3, 2023, 4:32 PM

#

my next question is: if you always use ff_ratio=1, can you skip the hidden dimension entirely and just run your activation function on the input and project to output via a single linear layer?

fallen spear Nov 3, 2023, 4:36 PM

#

dawn vine my next question is: if you always use ff_ratio=1, can you skip the hidden dimen...

I would think this would harm performance, the number of nonlinearities along a path generally matters a lot

dawn vine Nov 3, 2023, 4:36 PM

#

my testing agrees with you 🙂

fallen spear Nov 3, 2023, 4:36 PM

#

... what is the most convoluted activation function

#

i think the answer is "attention", actually

fallen spear Nov 3, 2023, 4:37 PM

#

dawn vine my testing agrees with you 🙂

you can go the other way and make it two matrices of intermediate state

dawn vine Nov 3, 2023, 4:38 PM

#

fallen spear you can go the other way and make it two matrices of intermediate state

sorry, not sure I understand what you mean

fallen spear Nov 3, 2023, 4:38 PM

#

dawn vine sorry, not sure I understand what you mean

instead of having two matrices to up project and down project you can run three or four

#

which is very silly

#

and also slow

dawn vine Nov 3, 2023, 4:38 PM

#

haha

#

well, being rwkv-like, this has all included a gating on the FFN as well

#

so it sort of does

fallen spear Nov 3, 2023, 4:39 PM

#

fallen spear i think the answer is "attention", actually

if we want to follow the swiglu-esque route I am thinking of "what would eat up parameters"

#

and throwing a qkv in there would do that

#

but i think we're verging on trolling <_<

dawn vine Nov 3, 2023, 4:41 PM

#

trolling ourselves lol

#

it's not a perfect test bc of parameter count mismatch but removing the gate in exchange for another layer was not a good trade

#

(just tried that)

#

my best revised hypothesis so far, from my testing results, is that hidden_dim=1 is a good trade for more heads, but that it may or may not be actually worth it in terms of overall training time slowdown

#

that slowdown will increase as you increase context length

#

whereas a larger hidden_dim is unrelated to context length

#

as a result, adding layers could be a more worthwhile trade, but might be more likely to cause you vram problems

fallen spear Nov 3, 2023, 5:01 PM

#

i’m gonna do a 1:1 pythia run at ratio 2 and 1 when i can

#

i might only go to a couple checkpoints on each

dawn vine Nov 3, 2023, 5:11 PM

#

fallen spear is this actually telling me the effective rank of a ff layer of this size is les...

this comment made me think about something I was already gonna try but never got around to... lora style ffn

fallen spear Nov 3, 2023, 5:14 PM

#

dawn vine this comment made me think about something I was already gonna try but never got...

someone in #research did that but i do not remember who

#

#research message

#

it was spy

#

there are other confounding variables maybe idk

dawn vine Nov 3, 2023, 5:15 PM

#

yeah not quite what I was thinking of

fallen spear Nov 3, 2023, 5:16 PM

#

dawn vine that slowdown will increase as you increase context length

wait: this doesn't apply to rwkv, does it?

dawn vine Nov 3, 2023, 5:18 PM

#

fallen spear wait: this doesn't apply to rwkv, does it?

i use traditional dense attention with some surrounding modifications in the time-mix section, but rwkv style FFN (channel mix)

#

but you're right, it wouldn't apply for rwkv

feral cairn Nov 3, 2023, 5:50 PM

#

So what's the upshot of the current tests @dawn vine @fallen spear

dawn vine Nov 3, 2023, 5:52 PM

#

feral cairn So what's the upshot of the current tests <@1007072846960410685> <@4416585874046...

for params=123m d_model=768, my tests showed that ff_ratio=1 is in fact better perplexity if you scale up n_layers or preferably n_heads accordingly to maintain parameter count
but the caveat is the training goes somewhat slower per token ingested and may use more vram

#

so where exactly the tradeoff becomes worthwhile is unclear, and probably dependent on a variety of factors (e.g. for attention, more heads will cost a lot for large context lengths, and for many layers vram or even convergence may become a problem)

feral cairn Nov 3, 2023, 6:00 PM

#

dawn vine for params=123m d_model=768, my tests showed that ff_ratio=1 is in fact better p...

And what activation function are you using?

dawn vine Nov 3, 2023, 6:08 PM

#

feral cairn And what activation function are you using?

rwkv style FFN w/ GELU and a sigmoid gate

fallen spear Nov 3, 2023, 6:08 PM

#

I’m looking at a straight pythia when I can set up an env which did not occur last night

dawn vine Nov 3, 2023, 6:10 PM

#

was just trying to get a feel for the outcomes so I used whatever I had handy, but ended up doing a bunch of tests once it got interesting 🤣

#

so ymmv with more traditional models or other ffn styles

fallen spear Nov 3, 2023, 6:22 PM

#

"larger d_ff does literally nothing" doesn't seem like it's doing well as a hypothesis although I'd be curious how it behaves at higher d_model especially, "putting the d_ff params somewhere else is usually better" looks pretty good to me rn

#

although, honestly: the fact that ablating d_ff only hurts it a little bit in this case is still an interesting result

#

most ablations hurt ... more than that, I think?

#

on a completed result if it's got a flat and steady ppl gap like it does in the preliminary test I'd want to calculate how much more training would close it since it is ablating approx 75% of params and that is a plausibly worthwhile tradeoff

dawn vine Nov 3, 2023, 6:32 PM

#

fallen spear on a completed result if it's got a flat and steady ppl gap like it does in the ...

my guess is that as it becomes a larger % of the model's parameters, ablating is gonna hurt a lot more

fallen spear Nov 3, 2023, 6:36 PM

#

dawn vine my guess is that as it becomes a larger % of the model's parameters, ablating is...

My argument for the reverse would be that as the d_model becomes wider state will be less bottlenecked and therefore less likely to benefit from up-projection

dawn vine Nov 3, 2023, 6:42 PM

#

this thing about the rank being only 10 is still hurting my brain

soft bobcat Nov 3, 2023, 7:14 PM

#

this is a sparsity idea inspired by the hidden dimension expansion. is there a way to be intermediate between (hidden=input dimension) and (hidden = 4x input dimension)? sure, let's suppose the hidden dimension is 4x the input dimension.

divide the hidden dimensions into 4 blocks, and label them 00 01 10 11.
divide the input dimensions into two sections: A and B.

section A outputs to 11 and 01. section B outputs to 11 and 10. block 00 receives no inputs; it's just there for notational convenience. as a consequence, each neuron outputs to half the hidden dimensions. the output connections are left fully connected for now.

#

here's the intuition: the binary code of the hidden dimensions says "what coordinates I will accept input from". 11 takes input from everything, so it's fully connected. 00 takes input from nothing, so it's not connected. the 11 block reproduces the ff_ratio=1 topology. the other dimensions are for passing through parameters without being able to process them, if that's what the extra dimensions are doing.

in the original transformer model, the purpose of those extra dimensions is probably to expand the number of nonlinearities that can be created. but maybe some of those nonlinearities don't require the full flexibility of all N dimensions; that is what the 10 and 01 blocks are for, as they do partial processing.

the division into A and B is hacky and mathematically unpleasant, so it won't do much more than passing through parameters.

the division can be extended, such as into 3-digit binary codes + sections ABC. in the limit, the construction actually makes sense both mathematically and computationally, since hidden dimensions correspond to tuples of input dimensions. but we're nowhere close to that.

dawn vine Nov 3, 2023, 7:21 PM

#

soft bobcat this is a sparsity idea inspired by the hidden dimension expansion. is there a w...

i like the direction you're going, but I'm a little unclear on what blocks 01 and 10 help with

#

my idea was more like this:
I adjusted a ff_ratio=1 FFN to learn additive differences i.e. instead of w_out(w_in(x)) it's w_out(x + w_in(x))
that still worked pretty good!
so then lora it i.e. w_out(x + w_in_b(w_in_a(x))) where w_in_a and w_in_b bring it down to 1/4 size and back

#

or maybe skip the w_out entirely and run the activation function in the middle of the lora sandwich

#

so far the lora part hasn't panned out very well 🙂

#

but I may have bad initialization

soft bobcat Nov 3, 2023, 7:46 PM

#

dawn vine i like the direction you're going, but I'm a little unclear on what blocks 01 an...

suppose there was no nonlinearity. then ff_ratio = 1 would be all that is necessary. since ff_ratio > 1 is creating benefits, that means there needs to be nonlinearities in more hyperplanes than there are input dimensions. but consider what these nonlinearities look like: most likely some are very nonlinear, and some are mostly linear. a 01 block has a reduced ability to express a nonlinearity - so it's more appropriate for a mostly linear axis.

#

so if we look at it from an expressibility point of view, if we have a mix of nonlinear and linear things, we put the linear things in the 10 and 01 buckets, which frees up the 11 buckets to express the nonlinear things. the 10/01 buckets are cheap and less powerful, good for linear things

#

and since we're in the high-dimensional regime, being expressible could mean that it's trainable too

dawn vine Nov 3, 2023, 7:49 PM

#

soft bobcat so if we look at it from an expressibility point of view, if we have a mix of no...

do you skip activation function on 01 and 10 and just run it on 11?

soft bobcat Nov 3, 2023, 7:49 PM

#

activation function on all three

#

consider what an activation function on 10 can do: its hyperplane can only handle dimensions from B. that kind of sucks. but that may be good enough for some nonlinearities, and the really hard nonlinearities train their way into being in 11 instead

fallen spear Nov 3, 2023, 9:06 PM

#

dawn vine so far the lora part hasn't panned out very well 🙂

my impression from other lora tricks during pretraining is that lora is actually kind of jank during pretraining

#

the final result for models might generally be low rank but gradients are not necessarily of low rank

shut badge Nov 4, 2023, 11:45 PM

#

So the original work here seems to claim that for a fixed parameter count you want more depth, rather than ff expansion. I was excited to try using a deeper but "thinner' model, but then I realized that the deeper + thinner model is going to be insanely more computationally expensive. Having very wide MLP's is good because it's very efficient and very parallelizable. So anyways, reducing MLP width a lot didn't even afford enough compute to add one more Attn+MLP while being compute equivalent

#

I really dislike "parameter equivalent" comparisons that ignore the compute requirement. No free lunch here.

#

If anything, my benchmarking lends a lot of credence to PaLM 1' very high 8x expansion.

fallen spear Nov 5, 2023, 2:54 AM

#

that’d be an interesting experiment too

#

but: parameter equivalent ignores compute (and anything else weird you do while doing it), compute equivalent ignores inference time (and maybe something else I am missing). ablation is nice because it lets you check exact amount of degradation along some axis

#

... actually that's an interesting perspective, computing isoflops against their param-matched thing, presumably there's a minima for perf/flops on their graph somewhere

#

i guess your objection isn’t compute so much as wall clock time; the projection adds little or no time because it’s nicely parallel and without it the ff is a lot thinner than attn so it’s a relative dead spot in the utilization

fallen spear Nov 5, 2023, 6:06 AM

#

current status

dawn vine Nov 5, 2023, 11:56 AM

#

shut badge So the original work here seems to claim that for a fixed parameter count you wa...

How about adding more attention heads instead of layers? That was more effective for me than adding layers in my tests (but may still be expensive if you're using traditional mha and not some linear attention variant)

fallen spear Nov 5, 2023, 1:14 PM

#

dawn vine How about adding more attention heads instead of layers? That was more effective...

Probably better than the approach in the paper for the test

fallen spear Nov 5, 2023, 1:30 PM

#

but: still leaves the ff pass as an underutilized block of time in the gpu

#

i am increasingly convinced to find more ways to jam parameters into the activation function

dawn vine Nov 5, 2023, 1:32 PM

#

fallen spear i am increasingly convinced to find more ways to jam parameters into the activat...

I'm still confused about your idea on that

fallen spear Nov 5, 2023, 1:35 PM

#

dawn vine I'm still confused about your idea on that

geglu or swiglu basically trade up projection size for a more complex trainable activation, any function that takes extra params can be used to do the same but more

#

i am definitely still woolgathering for what function makes more sense though

#

basically geglu/swiglu replace an activation function of one variable with an activation function of two, if you assume up projection is the worst option you should be able to go from one big matrix to doing an element wise multiplication of four matrices with (virtually any) nonlinearity applied to them and beat the up projection as an operation

#

i am trying to consider only one index in the activation and think of functions of three or four scalar variables that have any obviously desirable or intuitive properties like how geglu/swiglu are “gating”

dawn vine Nov 5, 2023, 1:57 PM

#

Then why not just use two ff_ratio=1 ffn w gates in a row

fallen spear Nov 5, 2023, 1:57 PM

#

dawn vine Then why not just use two ff_ratio=1 ffn w gates in a row

increases depth but doesn’t sound bad

dawn vine Nov 5, 2023, 1:58 PM

#

I guess that's still fewer parameters bc gate is only 1 way

fallen spear Nov 5, 2023, 1:58 PM

#

alternatively do (gate + gate + gate) * nongate

#

my brain wants there to be an elegant analytic thing that makes some kind of sense to eat up params

fallen spear Nov 5, 2023, 2:24 PM

#

other possibilities: maxout, multiply nonlinear gates together

feral cairn Nov 5, 2023, 2:29 PM

#

Have y'all read the muP / hyperparameter transfer work? I'm wondering if it implies we don't need to search through the whole search space for every model size

soft bobcat Nov 5, 2023, 2:32 PM

#

fallen spear i am trying to consider only one index in the activation and think of functions ...

from an expressibility point of view, if you have two activation functions that are different but equally good, then using both within a layer should be at least slightly better, as long as it's free computation-wise to divide a layer into two blocks and calculate different activation functions on each block

dawn vine Nov 5, 2023, 2:47 PM

#

feral cairn Have y'all read the muP / hyperparameter transfer work? I'm wondering if it impl...

I have - which hyperparameters were you thinking of that may need adjustment for this project?

fallen spear Nov 5, 2023, 3:23 PM

#

feral cairn Have y'all read the muP / hyperparameter transfer work? I'm wondering if it impl...

I have read it, my impression is that at the smaller scales we’re using things like depth are just fundamentally non transferrable because they behave differently at scale; i think i’d be more comfortable making that call for given adjustments iff those adjustments seemed to be scale invariant as we doubled total size a time or two

fallen spear Nov 5, 2023, 3:24 PM

#

soft bobcat from an expressibility point of view, if you have two activation functions that ...

it should be very close to free

fallen spear Nov 5, 2023, 3:26 PM

#

fallen spear I have read it, my impression is that at the smaller scales we’re using things l...

eg for simplest case if we had a straight line for effectiveness of ratios 1, 2, 4 across a couple scales i’d be comfortable that the ff ratio’s properties probably weren’t scale dependent

#

but: given, as a confounding variable, that they are also varying depth: that doesn’t appear to be the case for existing work

#

it is possible it’s only scale dependent at smaller scales

fallen spear Nov 5, 2023, 4:03 PM

#

fallen spear I have read it, my impression is that at the smaller scales we’re using things l...

also going to stop being an empiricst and say: i should probably check if the muP's theoretical justification for transferrability would apply to d_ff ratio

fallen spear Nov 5, 2023, 4:05 PM

#

soft bobcat from an expressibility point of view, if you have two activation functions that ...

also: this is clever and if it cleanly beats existing ffs it will be amusing

#

it is so simple

soft bobcat Nov 5, 2023, 4:08 PM

#

mixture of 1D activation functions + bilinear GLU = can express any product of usual functions, as well as their sum. such as polynomial approximation, or x^2 e^-x, etc

#

plus function composition, like e^(-x^2), composing x->e^x and x->x^2

fallen spear Nov 5, 2023, 4:10 PM

#

soft bobcat plus function composition, like e^(-x^2), composing x->e^x and x->x^2

i don't see how it does composition but maybe if i squint at it longer

soft bobcat Nov 5, 2023, 4:10 PM

#

you need two layers in this case

#

honestly, it's really like a universal function approximator. take any expression, decompose it into single operations (such as by Reverse Polish Notation). then each layer computes one more term of the expression

fallen spear Nov 5, 2023, 4:12 PM

#

i can see how to do it with successive layers, i'm a little more iffy because that adds depth

soft bobcat Nov 5, 2023, 4:13 PM

#

yeah, no need to increase depth here. for a single layer, improvements should already exist

#

well, maybe. the benefit of having different activation functions in a layer is second order, which means the benefit is stronger in longer training runs than in shorter ones. but I wouldn't change the length of a training run just to inspect this change

#

here, what I mean by first order is "this neuron is outputting 5 but it should output 10", and then it moves from 5 to 10 linearly. second order means the change is through curvature rather than slope

fallen spear Nov 5, 2023, 4:18 PM

#

That makes sense to me, it is sort of interesting which mathematical points are clear and which aren't

#

... heh, I go looking at mUP transfer and their primary reference is this bit of NTK: https://arxiv.org/abs/2011.14522

fallen spear Nov 5, 2023, 5:23 PM

#

and: it would have been concurrent work with gpt-neox, not sure if there's a strong case for gpt-neox to use the initialization etc it does instead of this one

dawn vine Nov 5, 2023, 8:43 PM

#

soft bobcat here, what I mean by first order is "this neuron is outputting 5 but it should o...

To clarify, does gelu(linear(x)) * sigmoid(linear_gate(x)) meet this definition for second order? And if so, are you proposing we add more terms so as to get to third or fourth order?

soft bobcat Nov 5, 2023, 8:48 PM

#

dawn vine To clarify, does gelu(linear(x)) * sigmoid(linear_gate(x)) meet this definition ...

here's an illustration of what I mean by second order: if you have points (0, 0), (1, 1), (2, 4), then a linear neuron can try to fit these points, and a quadratic neuron can fit them. the competition between these two neurons is second order. it takes time to train weight away from the linear neuron into the quadratic neuron. being second order in this context is a bad thing; it just means training is slower.

my proposal is to have a mixture of existing well-performing activation functions within a single layer (like ReLU and GLU), rather than to increase the dimension of an activation function.

dawn vine Nov 5, 2023, 8:50 PM

#

soft bobcat here's an illustration of what I mean by second order: if you have points (0, 0)...

Sorry, I'm basically asking if gating already accomplishes this via use of a second activation function

soft bobcat Nov 5, 2023, 8:51 PM

#

in terms of achieving diversity, yes. having two activation functions is better than one.

#

there is an unpleasant niggle there where multiplying two functions causes quadratic behavior, which causes gradient explosion (to a small extent) and hence should slow training down a bit. so gating may not be free. but tests seem to indicate it's worth it anyway

dawn vine Nov 5, 2023, 8:53 PM

#

Yeah I've seen it always be worth it

#

But if two is good, is three better?

#

Because gating is already a somewhat standard FFN improvement and maybe it works for the reason you outlined

#

Also, people already reduce the ffn width to accommodate the gate's parameter increase

#

Which is exactly the kind of thing we are discussing

soft bobcat Nov 5, 2023, 8:55 PM

#

a three-gated neuron could be better, but it's just speculation from my part. I don't have a mathematical understanding of how to balance the tiny gradient expansion issue with the improvement from diversity

dawn vine Nov 5, 2023, 8:55 PM

#

Your point about gradient explosion is well taken

#

I like the idea of allowing the ffn to simulate a more complex function tho

#

Via say a fourth order polynomial

soft bobcat Nov 5, 2023, 8:58 PM

#

conceptually, mixing one-dimension and two-dimension activation functions (especially bilinear rather than GeGLU) is very similar to having skip-connections

fallen spear Nov 7, 2023, 3:07 PM

#

turns out i had corrupt files from git lfs and the loading script was crashing silently :P

#

going to do a manual parity check once current lfs is done since apparently lfs doesn't have a parity check???

fallen spear Nov 14, 2023, 11:30 PM

#

uhhhh so i finally did the math correctly

#

if you do mxm instead of mx4m projection you save 6m^2 parameters

#

and then if you jam all those params into your up projection, your inner dimension is 6m, and you can then run a complicated nonlinearity on them to reduce the resulting state down to size m

#

giving you a d_ff to d_model of 6 if you count from before the nonlinearity

#

and 1 if you count after

#

at equiflops

#

i was wondering why palm reported a ratio of 6

dawn vine Nov 15, 2023, 12:14 AM

#

im missing something, don't really understand what parameters are going where in that

#

how are the experiments going?

fallen spear Nov 15, 2023, 2:47 AM

#

you take an input vector of size m, you run one m by m6 matrix or equivalently six separate m by m matrices

#

you perform (some nonlinearity that reduces size of vector by a factor of six) and then you apply another m by m matrix

#

you have d_ff/d_model of six if you count before the nonlinearity is applied and are at equal params to standard transformer attention

#

palm reports a ratio of six and as far as i’m aware nobody knows what nonlinearity they use or why it’s six

#

so … gonna go with that theory

#

it’s sort of a natural extension of swiglu or geglu

shut badge Nov 15, 2023, 11:08 PM

#

dawn vine To clarify, does gelu(linear(x)) * sigmoid(linear_gate(x)) meet this definition ...

happy to try this or other exps if folks are curious, have the setup for it.

#

this kinda reminds me of the sigmoid gating in RWKV

fallen spear Nov 16, 2023, 12:06 AM

#

realized you could also gate the residual connection

dawn vine Nov 16, 2023, 3:42 AM

#

fallen spear realized you could also gate the residual connection

I've tried it. Was a small but measurable advantage

#

Only way it worked was t, 2-t

fallen spear Nov 16, 2023, 3:43 AM

#

Can you break down how it was gated in more detail? Not sure I get it.

dawn vine Nov 16, 2023, 3:46 AM

#

Sure.


t = residual_mix(x)
x = x * t + fn(x) * (2-t)

#

No guarantees it works in a more general setting, and at least one person tried it and claimed the gating ended up stuck at like .9 for all channels

#

But it consistently gave me a small loss benefit in training. Never looked into it enough to know if that was still true at validation or test time

dawn vine Nov 16, 2023, 3:58 AM

#

shut badge happy to try this or other exps if folks are curious, have the setup for it.

@fallen spear are there larger ffn exchange experiments you need done? E.g. a larger version of my swap ff_ratio for attention heads test

fallen spear Nov 16, 2023, 4:11 AM

#

dawn vine <@441658587404697600> are there larger ffn exchange experiments you need done? E...

i have gotten myself stuck in environment hell, does it work for you if I just send FF layer variants at equiparams

#

(on the plus side i should have a really clean environment i can reuse whenever i climb out of my current hole)

fallen spear Nov 16, 2023, 6:59 AM

#

https://gist.github.com/segyges/4e8c65913df415e8c214d8e27dadbb8b
take your pick
the only ones I am sort of serious about are ProductPoolingWithGelu and YinLU

#

todo: figure out how folks are doing their residual connections, I can use some of the literally six different matrices we have to play with to gate and scale it, probably with a swiglu or geglu. will have fundamentally different call signature inasmuch as normally the FeedForward is self-contained

shut badge Nov 16, 2023, 7:20 AM

#

I can try some of these tomorrow!

fallen spear Nov 16, 2023, 7:27 AM

#

fallen spear https://gist.github.com/segyges/4e8c65913df415e8c214d8e27dadbb8b take your pick ...

actually, maybe try product pooling without the gelu too

dawn vine Nov 16, 2023, 12:39 PM

#

fallen spear https://gist.github.com/segyges/4e8c65913df415e8c214d8e27dadbb8b take your pick ...

@boreal moss already tried some variations very similar to these, but with only 4x. They worked pretty well, especially for smaller and/or non LLM models.

fallen spear Nov 16, 2023, 2:04 PM

#

the ones to beat are geglu and swiglu since they’re currently considered sota

#

my guess is going to be that several of these are better bc the params are better literally anywhere but in a down proj

#

(geglu and swiglu not represented in that gist)

fallen spear Nov 16, 2023, 3:01 PM

#

writing all those out gave me more ideas

boreal moss Nov 16, 2023, 3:57 PM

#

more than 3 parallel layers multiplied together makes it perform worse, adding any activation function anywhere makes it worse, I was testing on <1M parameter models and it was better than gelu by a large margin, doesn't look like it holds for bigger models

fallen spear Nov 16, 2023, 4:50 PM

#

boreal moss more than 3 parallel layers multiplied together makes it perform worse, adding a...

can you give the exact activations used that performed poorly once scaled at all?

#

there is a basic problem where there are too many different things you can do so specificity is helpful

soft bobcat Nov 16, 2023, 4:52 PM

#

I expect this to be significantly easier to train than YinLU: pooled = torch.sin(x_1) + torch.exp(-torch.pow(x_2, 2)) + torch.tanh(x_3) + torch.nn.functional.gelu(x_4) + x_5 * x_6. the reason why is that multiplying lots of random variables together will cause gradient imbalance (= gradient explosion, even if normalized by layer norm), since occasionally the multiplication will be very big and usually not so big.

fallen spear Nov 16, 2023, 4:52 PM

#

soft bobcat I expect this to be significantly easier to train than YinLU: pooled = torch.sin...

i christen it yinlu2

soft bobcat Nov 16, 2023, 4:53 PM

#

of course, the reason why the original activation function is called "YinLU" is because Gyges deserves all the credit for the idea, so by Stigler's law, I got the title

fallen spear Nov 16, 2023, 4:54 PM

#

i propose fourierlu: sum of sin(x_1*x_n) for n from 2 to 6

#

i cannot write it right this second

fallen spear Nov 16, 2023, 4:56 PM

#

soft bobcat of course, the reason why the original activation function is called "YinLU" is ...

tbh i wonder how many examples of stigler’s law are due to name exhaustion, naming one after you was the last step before i just started concatenating silly prefixes onto each other

#

i can at least remember that yinlu is “precisely that function that was transcribed verbatim from a message by kevin yin”

#

i have no idea which is which for the others

fallen spear Nov 16, 2023, 4:59 PM

#

soft bobcat I expect this to be significantly easier to train than YinLU: pooled = torch.sin...

i suspect this is also a good test because we should see some of the terms zero out consistently by operation which will give us a good metric for which carry more signal

soft bobcat Nov 16, 2023, 5:06 PM

#

yes, inspecting the activations and L2 weights of the final network is a good idea. in my original plan, which is just a mixture of different neurons inside a layer, I would dynamically change the proportions of the activation functions to reward the ones that did better for a problem. but since all 6 activation functions are summed together in your network, their proportions are fixed to 1/6, so it's all-or-nothing

#

note that it's the nonlinearity you care about inspecting, not exactly the zeroing out: what separates one activation from another is whether the inputs cause it to travel along a wide range and express a diversity of its outputs. if the inputs stay within a small domain (like [-0.1, 0.1] for cos(x)), the activation function isn't special; it looks just like a parabola over that stretch, just like any other activation function

fallen spear Nov 16, 2023, 5:40 PM

#

fallen spear i propose fourierlu: sum of sin(x_1*x_n) for n from 2 to 6

in retrospect it's not a linear unit, so call it "fourier pooling" or something

fallen spear Nov 16, 2023, 5:48 PM

#

fallen spear i propose fourierlu: sum of sin(x_1*x_n) for n from 2 to 6

also i think this removes the need for layer norm because the output is bounded to a maximum of 5

#

i guess unless "collapse of activations to zero" is one of the collapse modes

soft bobcat Nov 16, 2023, 5:53 PM

#

fallen spear also i think this removes the need for layer norm because the output is bounded ...

I agree, no layer norm needed

#

performance of fourierlu will probably suck though, for the same reason that sigmoid fell out of favor

#

there's an alternative Fourier formulation, where you take x_1 and x_2 as a complex number x_1 + i x_2, and same for x_3 and x_4, and x_5 and x_6. so it becomes a complex exponential e^(z_1 z_2) + e^(z_1 z_3). however, e^(x) may be a very crazy activation function, and it also produces two outputs instead of one. although exp is the more natural fourier structure, unless exp(x) behaves well (I doubt it), it will also not work

fallen spear Nov 16, 2023, 5:57 PM

#

soft bobcat there's an alternative Fourier formulation, where you take x_1 and x_2 as a comp...

you can take angle or magnitude for the resulting complex to make it scalar

#

but then i think maybe it reduces to an inner product

fallen spear Nov 16, 2023, 5:59 PM

#

soft bobcat performance of fourierlu will probably suck though, for the same reason that sig...

i am not sure this apples btw specifically because sigmoid gradient approaches zero as x goes to infinity

soft bobcat Nov 16, 2023, 6:00 PM

#

fallen spear i am not sure this apples btw specifically because sigmoid gradient approaches z...

ok, that's a good point. in terms of activation functions, all intuition is thrown away and we are reduced to guessing, so the sigmoid lesson is sufficiently dissimilar that it doesn't transfer

fallen spear Nov 16, 2023, 6:01 PM

#

i think also you can rescale your linear layer for cases where you end up below or above 2pi?

soft bobcat Nov 16, 2023, 6:03 PM

#

fallen spear i think also you can rescale your linear layer for cases where you end up below ...

what are the weights and biases for the case sin(x_1 x_2)? is it sin((w_1 x_1 + b_1)(w_2 x_2 + b_2))?

fallen spear Nov 16, 2023, 6:04 PM

#

soft bobcat what are the weights and biases for the case sin(x_1 x_2)? is it sin((w_1 x_1 + ...

yes

#

My concern would be numerical stability if any of the X end up very large, having perhaps gone through many cycles of pi

soft bobcat Nov 16, 2023, 6:06 PM

#

with this set of weights and biases, you can't shift by 2 pi anywhere

#

sin((w_1 x_1 + b_1)(w_2 x_2 + b_2) + 2pi) gives the same value, but there isn't a way to merge the 2pi into any of the constants

#

whereas if we had sin((w_1 x_1 + b_1) + 2pi), we could write b = b_1 + 2pi and do a shift

fallen spear Nov 16, 2023, 6:08 PM

#

huh, maybe it would need normalization of input then

#

“move normalization into the ff layer” was not on my bingo card

soft bobcat Nov 16, 2023, 6:09 PM

#

I don't see how the input could get large either. it's a sum of many sines (all capped at 1), times a dense matrix. so the input x will only get as large as the terms in the dense matrix

#

oh wait, the input is from the attention layer, not another fourierlu layer

fallen spear Nov 16, 2023, 6:10 PM

#

well, attn + linear

#

i would suspect that empirically with reasonable init and reasonable lr it will not be a problem

#

but it's hard to rule out completely

soft bobcat Nov 16, 2023, 6:11 PM

#

my knowledge of norm functions is also pure guessing, and my intuition doesn't work

#

sometimes batch norm is needed, sometimes layer norm, etc

fallen spear Nov 16, 2023, 6:11 PM

#

my other concern is zeroing out inputs, especially if x_1 manages to become zero

#

since sin(0) is 0

#

i dimly suspect slight nudge of noise might be a good idea

soft bobcat Nov 16, 2023, 6:12 PM

#

you could change sin to cos. maybe it will make a difference, maybe not

fallen spear Nov 16, 2023, 6:12 PM

#

i think same problem, at input 0 you have constant output and no gradient

soft bobcat Nov 16, 2023, 6:12 PM

#

right, cos is even worse

#

so sin(x_1 x_2) is effectively bilinear when x_1 or x_2 is near 0

fallen spear Nov 16, 2023, 6:13 PM

#

i think in some high level way they are the same problem and have the same solution, which I might be predisposed to see as gauss for fern reasons

soft bobcat Nov 16, 2023, 6:14 PM

#

nah: sine is bilinear when x_1 or x_2 is near 0. but cosine is fixed to 1 when either x_1 or x_2 is near 0 and the non-zero variable has no effect. it's a quadratic vs linear problem

fallen spear Nov 16, 2023, 6:14 PM

#

i read like four papers about trigonometric activations and cannot remember if they addressed this specific thing which seems very prominent when just looking at it mathematically

#

https://arxiv.org/abs/2006.09661 this one, I think, was the best mathematical treatment

#

they're pretty opinionated about initializations so worth thinking of

soft bobcat Nov 16, 2023, 6:31 PM

#

this paper illustrates how ReLU can only represent low-frequency details efficiently

fallen spear Nov 16, 2023, 6:59 PM

#

updated, with more bad ideas: https://gist.github.com/segyges/4e8c65913df415e8c214d8e27dadbb8b

#

i think all of these are interesting from InnerProductPooling on down

shut badge Nov 16, 2023, 10:01 PM

#

Any prioritization of which ones you think are most interesting?

#

That's a lot of ideas!

#

I'll start with the first one, except i plan to use an expansion of 4*4/3 instead of 6 to keep it consistent with my baseline.

fallen spear Nov 16, 2023, 10:12 PM

#

6 should be equiparams with a standard ratio of four

#

basically: the extra params are coming out of the down projection

#

i guess we could also push the extra params into the down projection

shut badge Nov 16, 2023, 10:14 PM

#

my baseline is geglu which was equiparams with gelu using expansion of 4

#

oh, wait, yea I see 🙂

fallen spear Nov 16, 2023, 10:14 PM

#

yes, equiparams with that in this regime should be 6 up and 1 down

soft bobcat Nov 16, 2023, 10:15 PM

#

is there a linear layer before AveragePoolingWithGelu?

fallen spear Nov 16, 2023, 10:15 PM

#

soft bobcat is there a linear layer before AveragePoolingWithGelu?

check top module in the file for intended impl, everything gets up projected x6

fallen spear Nov 16, 2023, 10:16 PM

#

shut badge I'll start with the first one, except i plan to use an expansion of 4*4/3 instea...

for my money: inner product pool, the geglu derivatives towards the end, the fourier pool are the most interesting

#

a significant fraction of the activations in the middle of the file are “why not”

#

and being unable to come up with why not

#

anything that multiplies in more than three things is almost certainly a bad idea

#

the max and average pool towards the beginning are almost certainly not good but might still beat geglu

soft bobcat Nov 16, 2023, 10:20 PM

#

but this is an LLM; how could average pooling possibly help? adjacent neurons are not related to each other

fallen spear Nov 16, 2023, 10:22 PM

#

my rationale is basically that geglu does not make any sense either and anything which pulls params away from the width of the down/up proj is probably an improvement by default

#

i don’t see a good, rational reason for avg pool to work

fallen spear Nov 16, 2023, 10:24 PM

#

fallen spear my rationale is basically that geglu does not make any sense either and anything...

i should say: away from the width of the activations

shut badge Nov 16, 2023, 10:25 PM

#

I'll try YinLU, inner product pool, fourier pool, and one of the geglus first

fallen spear Nov 16, 2023, 10:25 PM

#

shut badge I'll try YinLU, inner product pool, fourier pool, and one of the geglus first

start with yinlu2 probably

#

since it’s additive instead of multiplicative it seems less likely to explode

shut badge Nov 16, 2023, 10:26 PM

#

and here's my training codebase just for reference, the baseline is the "ibt" model: https://github.com/jonmorton/tart

GitHub

GitHub - jonmorton/tart: training autoregressive transformers

training autoregressive transformers. Contribute to jonmorton/tart development by creating an account on GitHub.

soft bobcat Nov 16, 2023, 10:27 PM

#

average pooling does not change what is expressible; it only changes the training dynamics. whatever function is expressed with dense matrix->avg pool->GeLU can also be expressed by (dense matrix with avg pool baked in)->GeLU. similarly, you can reverse the transformation with (dense matrix with avg pool inverse)->avg pool->GeLU to produce just (dense matrix) -> GeLU. that means AveragePoolingWithGelu is effectively just a GeLU layer. in practice, the training will be different since it's imposing an inductive bias on training of "nearby neurons have some similarity", but that's not an inductive bias that is appropriate in the setting

#

I would also have to think about the training dynamics; it's possible that the gradient undoes the avg pool, but I have no intuition about that

fallen spear Nov 16, 2023, 10:29 PM

#

soft bobcat average pooling does not change what is expressible; it only changes the trainin...

i think the same argument can be made for up projection; in principle up-projection x4 and down-projection x4 doesn't seem like it should increase expressivity, in practice it does help at least a little because it's easier for the nonlinearity to "find" good values in the 4x larger activation, apparently?

shut badge Nov 16, 2023, 10:29 PM

#

fallen spear since it’s additive instead of multiplicative it seems less likely to explode

yann lecun said one of his biggest lessons from the success of transformers is the power of multiplicative interactions though XD

fallen spear Nov 16, 2023, 10:30 PM

#

shut badge yann lecun said one of his biggest lessons from the success of transformers is t...

sure, just, six of them in one thing seems like maybe a bit much

soft bobcat Nov 16, 2023, 10:30 PM

#

up-projection does improve expressivity of the nonlinearity. for example, consider a 1D input, then a super-wide hidden dimension with ReLU activations. the hidden dimension lets you fit a piecewise linear function, and the number of kinks you're allowed in the piecewise linear fit is the dimension

#

or, consider Fourier activation. the larger the hidden dimension, the higher the coefficients in the sines can be, and hence the sharper the fit can be (because it can express higher frequencies)

fallen spear Nov 16, 2023, 10:33 PM

#

i am convinced that avg pool is probably not even worth testing

#

but: you can have a piecewise linear function with (say) up to 4,000 kinks, does it benefit you if you spend 4x as much compute and can in theory fit one of 16,000 kinks?

#

or, more clearly: does it benefit you a lot?

soft bobcat Nov 16, 2023, 10:34 PM

#

my intuition here is that training becomes a problem first. especially the asymptotics: fitting x^2 with a ton of kinks will be a big mess. it's simply not the appropriate activation

#

so even if you fit y = x^2 for x in [-100, 100] perfectly with 4000 kinks, you will never have the right asymptotic and then the fit will naturally become very bad at x=1000

fallen spear Nov 16, 2023, 10:35 PM

#

https://arxiv.org/abs/2002.05202v1 I read this in a good amount of detail and the indication seems to be that basically anything other than straight up and down projection is preferable

#

but: it limits itself to nonlinearities of two variables, from which we get the now-sota geglu/swiglu

soft bobcat Nov 16, 2023, 10:36 PM

#

I also share that takeaway, but be careful with the statistics. the p-values are probably not even 0.05

fallen spear Nov 16, 2023, 10:36 PM

#

i don't think we can even meaningfully calculate a p-value on them

#

kat specifically says that her experience has been that geglu beats swiglu even though that paper has swiglu win

soft bobcat Nov 16, 2023, 10:37 PM

#

sure you can, there are two distributions: 1-variable and 2-variable activations. you can calculate standard deviations, and there's a statistical tool that lets you test if two sample distributions have different means

soft bobcat Nov 16, 2023, 10:37 PM

#

fallen spear kat specifically says that her experience has been that geglu beats swiglu even ...

also my experience, but I think it's just noise

#

https://www.graphpad.com/quickcalcs/ttest1/

Screenshot_2023-11-16_at_14-39-49_T_test_calculator.png

#

so according to normal statistical standards, this paper doesn't meet the bar for significance. it's still a good paper because of the cost of the experiments

fallen spear Nov 16, 2023, 10:45 PM

#

so to reiterate motivation: based on papers linked up top with a fair degree of apparent reproducibility, pulling params out of ff up/down and putting them into depth generally wins at scales of I think 50-million-ish params, undertested (imho) regimes are: simply ablating those params outright and checking what the impact is as scale increases, using weird pooling-ish activation functions that extend on the regime used by geglu/swiglu (as stuffed into that gist), pulling the d_ff params and using them to increase d_model, stuffing those params into extra attention heads, and (occurred-to-me-today): using some of those params to gate the residual connection

#

weird pooling functions are just kind of the most compelling because it's straightforward to test with and also it scratches the itch to do weird math

#

i guess: and if it works it presents the same "clear win" criteria that swiglu/geglu do where in principle you aren't trading off anything

#

also, at scale the up/down proj dominates model RAM footprint and it seems incredibly silly to me if they are not serving a very good purpose

#

the idea that llama is a 70b model and could profitably be a 20b model at negligible loss and there isn't meaningful empirical testing to say otherwise kind of offends me

fallen spear Nov 16, 2023, 10:52 PM

#

soft bobcat so according to normal statistical standards, this paper doesn't meet the bar fo...

i would say it just needs more trials but each trial costs a fortune at scale and scaling annihilates so many results from smaller models

#

also, the fact that palm has no published crunchy technical details but reports a d_ff/d_model of six makes me suspect they are doing something like this to do a more complex nonlinearity that ultimately reduces the activation size to d_model, because six is the ratio you'd get if you tried to do this and remain at equiparams

#

which i figured out ... a day or two ago, even though I'd been obsessing about this for some time

#

so i would actually guess that google internal research has already done this and found it to beat swiglu/geglu

#

i have resisted the urge to tag the google/deepmind folks in here to ask them

#

it seems like that would be impolite

soft bobcat Nov 16, 2023, 11:04 PM

#

fallen spear also, the fact that palm has no published crunchy technical details but reports ...

I agree. I expect some private research companies are already using mixture-of-activations, and your version of it seems no worse than other versions, given that intuition about activations is noisy and difficult

fallen spear Nov 16, 2023, 11:06 PM

#

it is frustrating that there is no mathematically obvious way to do it, which puts us into the "there are really an infinite number of six-to-one functions" regime

fallen spear Nov 16, 2023, 11:08 PM

#

shut badge and here's my training codebase just for reference, the baseline is the "ibt" mo...

also this looks substantially similar to what I was starting to write yesterday and i am unsure if i am going to just clone it tbh, probably not

fallen spear Nov 16, 2023, 11:22 PM

#

fallen spear but: you can have a piecewise linear function with (say) up to 4,000 kinks, does...

actually to extend this: every indication of the rank of transformer up/down proj matrices I have seen indicates that they are of extremely low rank

#

like, one of the linked papers has them at ten by their measuring method

#

it seems bizarre that an m by 4m matrix, when trained, has an effective rank of ten if the projection to 4m is doing anything important

boreal moss Nov 16, 2023, 11:43 PM

#

"ff1"
fc1 = nn.Linear(64, 256)
fc2 = nn.Linear(256, 64)
x = F.gelu(fc1(x))
x = fc2(x)


"ff2"
fc1 = nn.Linear(64, 128)
fc2 = nn.Linear(64, 128)
fc3 = nn.Linear(64, 128)
fc4 = nn.Linear(128, 64)
x = fc1(x) * fc2(x) * fc3(x)
x = fc4(x)


"ff3"
fc1 = nn.Linear(64, 128)
fc2 = nn.Linear(64, 128)
fc3 = nn.Linear(64, 128)
fc4 = nn.Linear(128, 64)
x = fc1(x) * fc2(x) * F.gelu(fc3(x))
x = fc4(x)

fallen spear Nov 16, 2023, 11:54 PM

#

fc2 in ff1 looks backwards to me, i am guessing it is meant to be baseline gelu ff block?

fallen spear Nov 16, 2023, 11:55 PM

#

boreal moss ```py "ff1" fc1 = nn.Linear(64, 256) fc2 = nn.Linear(256, 64) x = F.gelu(fc1(x))...

a pretty sexy graph though

soft bobcat Nov 16, 2023, 11:58 PM

#

ff3 improving more than ff2 at the far end shows a behavior that should be pretty typical: the gelu breaks symmetry, so it has more expressivity, but training that expressivity is second-order rather than first-order. so the benefit from symmetry breakage appears only after a longer period of training

fallen spear Nov 17, 2023, 12:00 AM

#

that gives me an itch to put gelu gates on the fourier pool

soft bobcat Nov 17, 2023, 12:00 AM

#

I also notice from this chart that all the training samples are loaded in the same order for each training run, which is ok

boreal moss Nov 17, 2023, 12:00 AM

#

yes

#

also ff2 and ff3 curves are a little more smooth than ff1

soft bobcat Nov 17, 2023, 12:02 AM

#

boreal moss also ff2 and ff3 curves are a little more smooth than ff1

smoothness as in less variance in the loss function? I don't see that in the chart

boreal moss Nov 17, 2023, 12:03 AM

#

I see it, but to be sure I would have to actually calculate it 🤣

fallen spear Nov 17, 2023, 12:03 AM

#

i added a gated fourier pool to the gist just for fun

#

could also put the gate around the entire thing though

boreal moss Nov 17, 2023, 12:06 AM

#

this is very special case with small ffns, it was 1D causal convolutional network with 8 layers and task was character level next token prediction on tiny stories

fallen spear Nov 17, 2023, 12:07 AM

#

fallen spear could also put the gate around the entire thing though

i added a second gated fourier pool

boreal moss Nov 17, 2023, 12:08 AM

#

but my point is that you all should try those multiplicative ffns without any additional activation functions

fallen spear Nov 17, 2023, 12:09 AM

#

with the exception of outright max/avg pooling layers i am kind of agnostic about which of these functions make any sense, i think it mostly makes sense to test those that are analogous to existing workable functions and/or that have good diversity with each other

#

"no actual activation function" definitely fits the bill

#

but: there's a whole body of stuff about how small scale stuff benefits from modifications that are neutral or bad at higher scale

boreal moss Nov 17, 2023, 12:10 AM

#

yes

#

from my experience it looks like in very small scale more expressive activation functions can make huge difference on the order of adding 25% more parameters and for larger scale the same modifications makes almost no difference, like if big models have no use of more expressive activation functions

fallen spear Nov 17, 2023, 12:18 AM

#

the currently-sota swiglu/geglu stuff is, as has been pointed out, relatively statistically marginal

shut badge Nov 17, 2023, 12:25 AM

#

Alright I had to redo the baseline training but now I've started the first exp, you can see the progress at https://wandb.ai/jonm/MLPs

W&B

jonm

Weights & Biases, developer tools for machine learning

#

Since avg pool looks like it's not going to do as well, will probably cut it off early.

soft bobcat Nov 17, 2023, 12:41 AM

#

yes, there's no reason to believe that avg pool will do well

fallen spear Nov 17, 2023, 12:44 AM

#

i am surprised it works at all

#

... also surprised that yinlu looks so nice so far

shut badge Nov 17, 2023, 12:45 AM

#

👀

#

Should have implemented eval perplexity though, token accuracy is probably a bad metric

#

either way the val loss should be instructive

fallen spear Nov 17, 2023, 12:48 AM

#

was gonna say, unless you are planning on running it rather long and it's rather big they shouldn't differ a lot

#

trajectories tend to be the same unless data is so small you can overfit or training is so good you grok, I think?

shut badge Nov 17, 2023, 12:49 AM

#

Yeah, I mean, sometimes I have seen things do slowly converge to a better result, so I prefer to wait a bit longer. But these are fairly short runs overall as far as typical llm training goes

#

Will try fourier next

fallen spear Nov 17, 2023, 12:51 AM

#

fourier is my special boy and the one i will be the most sad about if it doesn't work

#

i have no good reason to believe it will work, and this feeling is entirely irrational

soft bobcat Nov 17, 2023, 2:25 AM

#

YinLU2 starts at really high loss. I'm pretty sure the reason is the torch.exp(-torch.pow(x_2, 2)) term, which is 1 at x=0

soft bobcat Nov 17, 2023, 2:49 AM

#

I think that means that y = 0 must be true when x = 0. either the NN intrinsically wants this property, or initialization wants this property. so YinLU2 is bad, and it would make more sense to use "pooled = torch.sin(x_1) + x_2 + torch.tanh(x_3) + torch.nn.functional.gelu(x_4) + x_5 * x_6", where the Gaussian is replaced by a simple no-activation passthrough

#

alternatively, torch.exp(-torch.pow(x_2, 2)) - 1 satisfies y = 0 when x = 0

fallen spear Nov 17, 2023, 3:57 AM

#

yinlu’s 3 and 4 are born

soft bobcat Nov 17, 2023, 4:03 AM

#

one interesting result from the tests is how well YinLU 1 trained in the initial stages. it looks like layer norm lets you get away with quite a bit of multiplication

fallen spear Nov 17, 2023, 4:09 AM

#

tbh enough of them are doing well enough that i suspect we’re not getting stellar signal, which is a positive result but doesn’t narrow our options a ton

soft bobcat Nov 17, 2023, 4:10 AM

#

which graph are you focusing on? I'm using loss/val

fallen spear Nov 17, 2023, 4:14 AM

#

assuming my slight colorblindness doesn’t hurt me too much here, we get: baseline, trigeglu, inner, yinlu2

#

that yinlu2 has such an obvious improvement and is potentially good for analysis makes it nicest prospect at the moment i think

#

i will ignore this and spend all my time trying to figure out how to make sin activation happen

soft bobcat Nov 17, 2023, 4:18 AM

#

fallen spear assuming my slight colorblindness doesn’t hurt me too much here, we get: baselin...

yinlu 1 was stopped pretty early, so it doesn't show up in this list. its performance is basically the same as baseline, up to noise

fallen spear Nov 17, 2023, 4:19 AM

#

soft bobcat yinlu 1 was stopped pretty early, so it doesn't show up in this list. its perfor...

… fairly absurd

#

huh. it does. wild

#

i think it’s the only one that stacks products that high, no?

soft bobcat Nov 17, 2023, 4:21 AM

#

of the ones tested, yes

fallen spear Nov 17, 2023, 4:22 AM

#

my guess would be that it’s either one of the specific activations or the act of doing so many products, the diversity of functions is neat but does not seem like it can possibly be optimal

soft bobcat Nov 17, 2023, 4:22 AM

#

it can't be a specific activation; activations can only suck but they can't be amazing

fallen spear Nov 17, 2023, 4:22 AM

#

then, at least early, the cumulative product is very good

#

a very silly result

#

i am starting to wonder if there is a level of ridiculousness that cannot possibly be a good idea

#

sum of power set of all products is clearly too ridiculous

#

in principle it needs to be an O(n) op

soft bobcat Nov 17, 2023, 4:26 AM

#

power set of products of a, b, c is just (a+1)(b+1)(c+1)

fallen spear Nov 17, 2023, 4:26 AM

#

… someday I will be good at math

#

So we can actually just do that

soft bobcat Nov 17, 2023, 4:27 AM

#

learning from the Gaussian issue, the correct activations should be instead "(a+1)(b+1)(c+1)-1" and "abc", where abc is from trombocyt

#

the difference between these two is that the first formula (with 3 inputs) is ~~quadratic~~ linear when a, b, c are near 0. and the second formula is cubic

soft bobcat Nov 17, 2023, 4:29 AM

#

soft bobcat the difference between these two is that the first formula (with 3 inputs) is ~~...

more easily, you can see this if there are only two input variables: (a+1)(b+1) - 1 = ab + a + b, and note that a and b are linear. so even if a is 0, the b term will still cause changes in output

boreal moss Nov 17, 2023, 4:30 AM

#

fc1 = nn.Linear(64, 64)
fc2 = nn.Linear(64, 64)
fc3 = nn.Linear(64, 64)
fc4 = nn.Linear(64, 64)
fcout = nn.Linear(256, 64)

x1 = fc1(x)*fc2(x)*fc3(x)
x2 = fc1(x)*fc2(x)*fc4(x)
x3 = fc1(x)*fc3(x)*fc4(x)
x4 = fc2(x)*fc3(x)*fc4(x)
x = fcout(torch.cat((x1, x2, x3, x4), dim=-1))

this weirdo is just a little better in small scale than "tri" linear

fallen spear Nov 17, 2023, 4:33 AM

#

i remain somewhat vexed that there is no obviously correct way to pile together params in this way

#

i guess i can find another good defense of stacking as many products together as possible

#

matmul already does addition

#

any function you can represent by addition is already well represented

shut badge Nov 17, 2023, 5:11 AM

#

soft bobcat I think that means that y = 0 must be true when x = 0. either the NN intrinsical...

Kicked this off as YinLU3

fallen spear Nov 17, 2023, 5:12 AM

#

shut badge Kicked this off as YinLU3

as modified from yinlu2, correct?

shut badge Nov 17, 2023, 5:12 AM

#

Generally seems some stuff is about as good as geglu. This is a strong baseline already (already better than vanilla GPT2 by a good margin), so not bad!

shut badge Nov 17, 2023, 5:12 AM

#

fallen spear as modified from yinlu2, correct?

yes

fallen spear Nov 17, 2023, 5:13 AM

#

shut badge yes

the first gist is too long so I am stuffing my extra brainworms into another one

#

it will have that and two more yinlus

#

and a power-product thing and two more fouriers

#

because ??? reasons

shut badge Nov 17, 2023, 5:13 AM

#

If there's anything from the original list i should try as well lmk

fallen spear Nov 17, 2023, 5:13 AM

#

i feel like the trigeglu and inner product are trying to tell me something but i don't know what

#

the other geglu derivatives might give us some clue about what is working and why

#

and, I guess, product pool, let me check to make sure I wrote it reasonably

fallen spear Nov 17, 2023, 5:14 AM

#

fallen spear i feel like the trigeglu and inner product are trying to tell me something but i...

something other than "products are good"

shut badge Nov 17, 2023, 5:15 AM

#

Seems the loss still starts pretty high with YinLU3

fallen spear Nov 17, 2023, 5:15 AM

#

fallen spear the other geglu derivatives might give us some clue about what is working and wh...

they basically vary: whether there is a quadratic term inside the gelu activation or multiplied into it

soft bobcat Nov 17, 2023, 5:15 AM

#

right, the failure of this activation function isn't because of f(0) = 1, it's something else

fallen spear Nov 17, 2023, 5:16 AM

#

https://gist.github.com/segyges/20bb00cf43fe617c2fdb7c6465698853
gist number two

soft bobcat Nov 17, 2023, 5:16 AM

#

this activation function is just bad somehow

shut badge Nov 17, 2023, 5:17 AM

#

sin, tanh, and and gelu all have different output ranges, does it make sense to add them together?

soft bobcat Nov 17, 2023, 5:17 AM

#

it does, but my guess is that sin and tanh are so bad that they don't make sense even in a mixture

shut badge Nov 17, 2023, 5:17 AM

#

er, well sin/tanh and gelu do

soft bobcat Nov 17, 2023, 5:18 AM

#

so forget about YinLU4, it's also almost certainly terrible

fallen spear Nov 17, 2023, 5:18 AM

#

i suspect baselining against product pool makes sense since it just forgoes the actual functions and does multiplication

#

... and powerproductpool

#

"eliminate effects besides the product operation", more or less

soft bobcat Nov 17, 2023, 5:21 AM

#

shut badge Seems the loss still starts pretty high with YinLU3

and YinLU3 can be stopped too, its loss is unlikely to ever be competitive

fallen spear Nov 17, 2023, 5:23 AM

#

we could maybe make a later note to ablate it to verify which of the components is harming it

#

on the "activation functions can't be awesome, they can only suck" theory

shut badge Nov 17, 2023, 5:23 AM

#

This seems like something that could be done through neural architecture search (not that I have the compute resources for that or anything)

fallen spear Nov 17, 2023, 5:24 AM

#

... hardmaru did something like that

#

I am coming at this from the "what sort of seems to make sense to me mathematically" direction but that is possibly the wrong direction

shut badge Nov 17, 2023, 5:25 AM

#

https://www.nature.com/articles/s41598-021-96723-8

Nature

Universal activation function for machine learning

Scientific Reports - Universal activation function for machine learning

#

perhaps i should try this too as another baseline

fallen spear Nov 17, 2023, 5:26 AM

#

five trainable parameters is a good number for us

soft bobcat Nov 17, 2023, 5:26 AM

#

them freezing the UAF parameters in the middle is very strange

#

"This is done to reduce the over-fitting of the model and to prevent training instability", which means they saw training instability. why would this instability occur?

fallen spear Nov 17, 2023, 5:27 AM

#

what are they actually training and does it even have normalization on it

#

NNs are naturally unstable, instability in a sandbox problem tells us very little

#

he VGG 8 layer CNN

#

yeah

#

don't worry about their experiments imho, their math looks incredibly clever though

#

it worked once in a tiny sandbox and the math in principle approximates any activation, good enough for me

#

... grad students at uvic

shut badge Nov 17, 2023, 5:31 AM

#

        act = (
            torch.log(1 + torch.exp(x_1 * (x + x_2) * x_3 * x**2))
            - torch.log(1 + torch.exp(x_4 * (x - x_2)))
            + x_5 * x_6
        )

fallen spear Nov 17, 2023, 5:31 AM

#

i think you can just use the sixth param as the input to the activation

shut badge Nov 17, 2023, 5:31 AM

#

o yea

fallen spear Nov 17, 2023, 5:31 AM

#

but i do not yet understand their math

fallen spear Nov 17, 2023, 5:32 AM

#

fallen spear ... grad students at uvic

my first guess was "someone in science who is very smart but not in ML" but students in uni without serious AI research program checks out

#

"we did something incredibly clever on a zero resource budget" thank god for resource constraints thank you so much

#

yeah, this is a much smarter approach to the problem than "make up equations and see if one of them eventually sticks"

shut badge Nov 17, 2023, 5:34 AM

#

And yeah no need to worry about training stability I think, pre-norm transformers are incredibly stable at this scale

#

famous last words though

fallen spear Nov 17, 2023, 5:35 AM

#

i believe firmly in jinxing myself as hard as possible any time i feel the impulse

#

are you really living if everything you say doesn't sound like famous last words

shut badge Nov 17, 2023, 5:36 AM

#

another note is that all the linear layers are initialized with orthogonal matrices. I could imagine with some of these activation functions that you might want to initialize each (dim, dim) chunk of the (dim, dim*6) matrix differently depending on how each of the 6 components is used in the activation

fallen spear Nov 17, 2023, 5:38 AM

#

every time i contemplate the possible effects of initialization on these oddball activations it makes me twitch

#

is a good future todo though

fallen spear Nov 17, 2023, 5:38 AM

#

fallen spear every time i contemplate the possible effects of initialization on these oddball...

if anyone was thinking "is this because you like trig functions and the one paper said to initialize specifically for trig functions" they get a gold star

#

i like waves okay

#

... it is possible some of the failed experiments can be salvaged with initializations specific to them but there is no strong reason to choose any of them especially to do this with

fallen spear Nov 17, 2023, 5:49 AM

#

fallen spear yeah, this is a much smarter approach to the problem than "make up equations and...

i am pretty sure i understand it now and i am envious that i did not write it

#

it looks like it died

#

which is a shame, because it is so clever

#

if there are any of these worth salvaging, that one is worth salvaging

#

actually I think that's a NaN which is an odd result, we might need to add an epsilon to something

shut badge Nov 17, 2023, 6:21 AM

#

lol, so that UAF did actually diverge. definitely jinxed it

fallen spear Nov 17, 2023, 7:23 AM

#

i think i see at least part of issue

#

taking x_1 as the input, should be:

act = torch.log(1 + torch.exp(x_2*(x_1 + x_3) + x_3*torch.pow(x_1, 2))) - torch.log(1 + torch.exp(x_4*(x_1-x_3))) + x_6

#

had an extra multiplication in the first exponent which should have been an addition, may have made it prone to diverging

#

or maybe it's diverging anyway because exponents be like that

#

who can say

#

not entirely clear it makes sense to have a trainable activation function per index instead of a trainable activation function for the entire layer

#

but 🤷

dawn vine Nov 17, 2023, 3:18 PM

#

Lol so complicated! Why not just make it resemble a 4 term polynomial like out(a(x)v(x) + b(x)v(x)^2 + c(x)v(x)^3 + d(x)v(x)^4) where abcdv and out are your six nn.linears

boreal moss Nov 17, 2023, 3:37 PM

#

I was playing with weird parametric activation functions, performance was highly dependent on initialization, those were non-monotonic functions and they just got stuck if in the path from initialization parameters to optimal parameters function was changing number of minima and maxima

fallen spear Nov 17, 2023, 3:57 PM

#

dawn vine Lol so complicated! Why not just make it resemble a 4 term polynomial like out(a...

one of them does this or almost this already

#

it’s not great

#

worth noting because i am sort of stumped: our constraints are total trainable params of 16*d_model^2 and two sequential matmuls + i guess some wiggle time equivalent to elementwise operations, it doesn’t necessarily need to be strictly a trainable linear up projection to 6 times as large then dim reducing nonlinearity to 1 then matmul

#

i am sort of thinking of doing the up projection first non-trainably and then down-proj

soft bobcat Nov 17, 2023, 4:03 PM

#

has PReLU ever worked in real life? (i.e. in practical models)

dawn vine Nov 17, 2023, 4:09 PM

#

fallen spear one of them does this or almost this already

yeah I see PolyLU but you didn't have the output projection, which is probably super important

fallen spear Nov 17, 2023, 4:10 PM

#

worth testing

dawn vine Nov 17, 2023, 4:10 PM

#

another option is to vary the activation function by channel/segment, which can give almost the identical polynomial result with way less work

fallen spear Nov 17, 2023, 4:11 PM

#

soft bobcat has PReLU ever worked in real life? (i.e. in practical models)

i have no idea

fallen spear Nov 17, 2023, 4:11 PM

#

dawn vine another option is to vary the activation function by channel/segment, which can ...

i don’t understand this one

dawn vine Nov 17, 2023, 4:13 PM

#

fallen spear i don’t understand this one

well the polynomial is multiplying by v(x)^n... so if you proj_in then have each quarter of that be multiplied by v(x)^1, v(x)^2, etc. then proj_out

#

the result is that the initial proj_in chooses what gets treated with which exponent

#

and the final proj_out can simulate the addition of the components of the polynomial

#

you can even vary the percentage dedicated to each exponent (including exponent zero)

#

c = self.coefficients # just some trainable parameters, these don't need to vary based on x
v = Wvariables(x)
y = cat([c[0], c[1]*v, c[2]*(v**2), c[3]*(v**3), c[4]*(v**4) etc. ])
return Wout(y)

fallen spear Nov 17, 2023, 4:18 PM

#

oh i get it

#

that’s clever

dawn vine Nov 17, 2023, 4:19 PM

#

probably way better usage of rank

fallen spear Nov 17, 2023, 4:19 PM

#

i guess: in principle it doesn’t matter at all if we have a trainable vector in the layer

#

because it is so small compared to a linear layer

dawn vine Nov 17, 2023, 4:20 PM

#

well im just trying to let it train in whatever function approximators it likes using as few parameters as possible

fallen spear Nov 17, 2023, 4:20 PM

#

sure, but what is v coming from?

#

my first impulse is “just make it a trainable vector”

dawn vine Nov 17, 2023, 4:21 PM

#

yes those W are trainable weights

#

who knows what it wants to do hehe

dawn vine Nov 17, 2023, 4:37 PM

#

i changed it a bunch above

#

basic idea is we don't need to spend 4*n^2 just generating coefficients for polynomials, since each set of possible coefficients just represents a different function

#

actually, they don't have to come from x at all

#

so uh isn't this just essentially the simplest possible trainable activation function lol

fallen spear Nov 17, 2023, 10:42 PM

#

possibly

fallen spear Nov 18, 2023, 4:41 PM

#

I’m out for most of the next week probably fwiw

exotic musk Nov 22, 2023, 6:14 AM

#

dawn vine ```py c = self.coefficients # just some trainable parameters, these don't need t...

I have seen similar idea in model approximation field. But they merely fit a polynomia func to activations( e.g. Gelu). I think this idea is worth to get a try

dawn vine Nov 22, 2023, 6:15 AM

#

exotic musk I have seen similar idea in model approximation field. But they merely fit a pol...

cool, trying it right now 🙂

exotic musk Nov 22, 2023, 6:15 AM

#

oh, it seems a long time! sorry.

exotic musk Nov 22, 2023, 6:17 AM

#

dawn vine cool, trying it right now 🙂

the main problem comes from converging. If you have some results, i 'd like to see the training loss empties

dawn vine Nov 22, 2023, 6:18 AM

#

will let ya know in a few minutes hopefully i'll have some results... just was too busy to try til just now

dawn vine Nov 22, 2023, 6:41 AM

#

exotic musk the main problem comes from converging. If you have some results, i 'd like to s...

im using coefficients inits 0,1,1,0,0 so it starts out as x+x^2 and changes from there

#

saw some folks using that as a replacement for relu

#

seems to work slightly worse than my existing activation but broadly similar

#

this is the code I tried:

class LearnedPolynomial(nn.Module):
    def __init__(self, dim:int):
        super().__init__()
        self.c0 = nn.Parameter(torch.zeros(dim))
        self.c1 = nn.Parameter(torch.ones(dim))
        self.c2 = nn.Parameter(torch.ones(dim))
        self.c3 = nn.Parameter(torch.zeros(dim))
        self.c4 = nn.Parameter(torch.zeros(dim))
    def forward(self, x):
        y = self.c0 + self.c1*x + self.c2*(x**2) + self.c3*(x**3) + self.c4*(x**4)
        return y

soft bobcat Nov 22, 2023, 6:47 AM

#

typo on x**3 next to c4

dawn vine Nov 22, 2023, 6:48 AM

#

good catch

#

rerunning with that bugfix

#

first test was almost indistinguishable from my 'standard' rwkv-style channel mix relu^2 activation (this is on a model using traditional attention tho)

soft bobcat Nov 22, 2023, 6:50 AM

#

my guess is y = ~~GLU~~GeLU(old y formula inside) will do a bit better. the logic is to create a flattening at the negative end. but my idea could be nonsense

dawn vine Nov 22, 2023, 6:51 AM

#

the concat?

soft bobcat Nov 22, 2023, 6:51 AM

#

dawn vine Nov 22, 2023, 6:51 AM

#

ah

#

WOW its winning now that i fixed the bug!

#

good catch indeed

#

cant believe this worked, even limitedly

soft bobcat Nov 22, 2023, 6:53 AM

#

another possible reason for GeLU: it increases expressibility by one degree of freedom. for the raw polynomial, if you scale all the input weights and decrease all the output weights, then the overall net is the same. so some of the parameters are redundant.

caveat: ReLU has this same issue and everybody liked ReLU for a long time

#

crap, I meant gelu instead of glu: the gaussian error unit

exotic musk Nov 22, 2023, 6:56 AM

#

@dawn vine Just replace the GeLU to GeLU(LearnedPolynomial(1)(\cdot)) in transformer, train it simply, and let's have a look!

#

0,1,0,0,0 to make no change as intitial results

dawn vine Nov 22, 2023, 6:58 AM

#

well this is all confounded by me using rwkv style ffn at the moment
i can run others ez, say llama2 or whatever, but i dont have the baselines already run

#

to be clear, the rwkv one works better in my experience

exotic musk Nov 22, 2023, 6:59 AM

#

rwkv is OK. no significant difference, I think

dawn vine Nov 22, 2023, 6:59 AM

#

well it normally uses relu^2 activation

#

anyway i can try gelu around it

#

it started out strong without gelu an ended up being almost identical to the classic relu^2 version over time

exotic musk Nov 22, 2023, 7:03 AM

#

oh... relu^2 is relu(relu()), or max(0,x^2)? I have not seen it before

soft bobcat Nov 22, 2023, 7:04 AM

#

max(0, x)^2

dawn vine Nov 22, 2023, 7:05 AM

#

gelupoly winning, but i forgot to zero out the x^2 term

exotic musk Nov 22, 2023, 7:08 AM

#

curious about the results, and the learned values from c0 to c4

dawn vine Nov 22, 2023, 7:08 AM

#

exotic musk curious about the results, and the learned values from c0 to c4

the whole idea is they may vary per channel!

#

so may be extremely hard to visualize

soft bobcat Nov 22, 2023, 7:09 AM

#

it can still be useful to do statistics: if nothing else, it helps initialization

thick briar Nov 22, 2023, 7:09 AM

#

this is fascinating 👀

dawn vine Nov 22, 2023, 7:09 AM

#

the other idea here was that this kind of activation may allow us to ditch the expansion/contraction in the FFN

#

because it lets it model things that otherwise required that

thick briar Nov 22, 2023, 7:09 AM

#

dawn vine so may be extremely hard to visualize

maybe a 2d graph (x channel, y c*)?

dawn vine Nov 22, 2023, 7:10 AM

#

the expressive power is really why I thought this might be a useful way to reduce FFN size

soft bobcat Nov 22, 2023, 7:10 AM

#

soft bobcat it can still be useful to do statistics: if nothing else, it helps initializatio...

by these statistics, I just mean simple bar charts, one for each c_n

exotic musk Nov 22, 2023, 7:11 AM

#

dawn vine the whole idea is they may vary per channel!

yeah. It might be interesting, if we plot GeLU(Poly(x)), and have a look what it diffent from GeLU()

soft bobcat Nov 22, 2023, 7:11 AM

#

dawn vine the expressive power is really why I thought this might be a useful way to reduc...

I believe in this. gating methods and Gyges's various methods all require a huge number of weights per input. but simple trainable activations capture much of that without the parameter explosion

dawn vine Nov 22, 2023, 7:11 AM

#

exactly!

#

i gotta go to sleep, but I can pick this up again tomorrowish

exotic musk Nov 22, 2023, 7:13 AM

#

good night

dawn vine Nov 22, 2023, 7:20 AM

#

haha straight gelu won by a bit

#

gnite@

exotic musk Nov 23, 2023, 12:56 AM

#

dawn vine haha straight gelu won by a bit

thonk

dawn vine Nov 23, 2023, 6:26 PM

#

I did some followup experiments, nothing really new to report... but I did try a very interesting FFN that didn't do well:
(this is just a sketch of it)

    self.coeffficients = nn.Parameter(torch.linspace(0,1,D))
    def forward(self, x : Tensor):
        return self.w_shrink(torch.cat([self.coeffficients.expand_as(x), x, x**2, x**3], dim=-1))

#

the idea was along the lines of my original thought, which is that the projection should be able to mix various polynomials out of the ingredients in that concatenated set of learned coefficients and x raised to various powers

fallen spear Nov 23, 2023, 6:35 PM

#

the only one in the gist that i have a lot of hope in presently is powerproductpool, maybe add a gelu gate to it

#

https://gist.github.com/segyges/20bb00cf43fe617c2fdb7c6465698853#file-ff_modifications_2-py-L58 this one

fallen spear Dec 11, 2023, 8:23 PM

#

on reflection so I don't stuff this into research:

#

it's possible arithmetic intensity on the up projection is low if it triggers swaps so that's another axis of tradeoff

#

worth tracking, anyway

fallen spear Dec 11, 2023, 8:26 PM

#

fallen spear it's possible arithmetic intensity on the up projection is low if it triggers sw...

that is to say: it is possible that even though the feedforward looks, mathematically, like it is a low-memory operation compared to attention (which has three entire matrices), so your utilization will be closer to 100% if you make your feedforward stage wider, the up-projection to x4 will trigger cache misses and so eat up a bunch of time swapping stuff into and out of vram when an x1 would not

#

so you'd gain on wall-clock time substantially when gearing it lower

soft bobcat Dec 17, 2023, 5:23 AM

#

I was thinking about effective rank, because if FF layers are undergoing rank collapse, might as well add -(effective rank) to the loss. but I think it's not effective at capturing rank collapse, because it's dominated by the single highest singular value.

#

if one singular value is bigger than all the others, the effective rank will be low. but I doubt that matters

#

the reason I became interested in rank collapse is this paper, which says attention-only transformers, without FF, collapse to rank 1: https://arxiv.org/abs/2103.03404

arXiv.org

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exp...

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attentio...

#

another paper tries to rearrange the attention and feedforwards, and I'm considering how the two papers interact

dawn vine Dec 17, 2023, 7:08 AM

#

How about vs where they're the same unified layer, e.g. mamba

fallen spear Dec 17, 2023, 7:09 AM

#

soft bobcat I was thinking about effective rank, because if FF layers are undergoing rank co...

how is effective rank measured that leads to this?

#

iirc there are a couple of ways to do it, none of them are actually svd

soft bobcat Dec 17, 2023, 7:11 AM

#

fallen spear oh, references: https://arxiv.org/abs/2001.08361 https://arxiv.org/abs/2310.1995...

same way as in this paper

fallen spear Dec 17, 2023, 7:11 AM

#

i confess to not having gone as deeply into that as i would like to have

soft bobcat Dec 17, 2023, 7:11 AM

#

dawn vine How about vs where they're the same unified layer, e.g. mamba

I don't know the mamba architecture yet, so I can't answer

#

dawn vine Dec 17, 2023, 7:12 AM

#

Mamba didn't invent it, but basically they expand 2x then use that like v then gate and contract
So the attention equivalent is done at 2x
But residual is still 1x
No separate ffn

soft bobcat Dec 17, 2023, 7:13 AM

#

oh hmm, I read it wrong. I thought it was dividing by the first eigenvalue

fallen spear Dec 17, 2023, 7:15 AM

#

soft bobcat

oh, you actually have ieee access?

soft bobcat Dec 17, 2023, 7:16 AM

#

fallen spear oh, you actually have ieee access?

https://core.ac.uk/download/pdf/147929764.pdf

fallen spear Dec 17, 2023, 7:18 AM

#

soft bobcat https://core.ac.uk/download/pdf/147929764.pdf

thanks

soft bobcat Dec 17, 2023, 7:20 AM

#

it appears erank just always gives low numbers. here's an example: I specify an infinite list of eigenvalues 1/x^2. that seems perfectly reasonable and our matrix isn't collapsed or anything

#

the sum of all these eigenvalues is pi^2/6, so I divide by this.

#

then I calculate the Shannon entropy: https://www.wolframalpha.com/input?i=sum+of+-1%2Fx^2+*+(6+%2F+pi^2)+*+ln(1%2Fx^2+*+(6+%2F+pi^2))+from+1+to+infinity

#

and the effective rank I get is 4.94

#

so this innocuous distribution of eigenvalues has very low effective rank

fallen spear Dec 17, 2023, 7:22 AM

#

weird definition

soft bobcat Dec 17, 2023, 7:23 AM

#

I actually like it other than the fact that it seems to have low informational value

#

it scales the eigenvalues so they sum to 1. then checks the Shannon entropy. all of these are natural operations

#

I think it's just really sensitive to the highest eigenvalues. because in my pi^2/6 example, the first eigenvalue, 1, takes 60% of the whole distribution

fallen spear Dec 17, 2023, 8:18 AM

#

i will have to think about it, I am not entirely sure how much sense it makes to measure those things

soft bobcat Dec 17, 2023, 4:55 PM

#

I decided it does make sense. if the input is a random vector, the effective rank should give you a good idea of how many ranks you need

#

so for a matrix with effective rank 10, using a 100-dimensional low-rank approximation should work ok

#

it's only if the input vector is non-random (such as if it avoids the eigenvector with the high eigenvalue) that effective rank breaks down

fallen spear Dec 17, 2023, 5:40 PM

#

soft bobcat so for a matrix with effective rank 10, using a 100-dimensional low-rank approxi...

i think this needs to be tested empirically

#

there should be some function which takes a matrix and its effective rank and gives back a low rank matrix that is a lossless approximation

soft bobcat Dec 17, 2023, 5:49 PM

#

my guess is it's already been tested and didn't work, so now I'm thinking about why

soft bobcat Dec 17, 2023, 5:49 PM

#

fallen spear there should be some function which takes a matrix and its effective rank and gi...

this won't work: in the 1/n^2 case, you need the whole infinite-dimensional matrix, to be lossless

fallen spear Dec 17, 2023, 6:18 PM

#

soft bobcat this won't work: in the 1/n^2 case, you need the whole infinite-dimensional matr...

because effective rank != rank?

#

would this work if we had a trainable eff_rank size matrix matmul'd into a static actual_rank size orthonormal matrix

#

(or: some function of eff_rank size)

soft bobcat Dec 17, 2023, 10:25 PM

#

fallen spear because effective rank != rank?

right. effective rank is only an approximation. so it tells you something like "you'll capture 99% of the eigenvalues" but you'll never get all of them

soft bobcat Dec 17, 2023, 10:25 PM

#

fallen spear would this work if we had a trainable eff_rank size matrix matmul'd into a stati...

what I expect is that people have tried low-rank approximations to the low-effective-rank matrices, but that they don't work

soft bobcat Dec 17, 2023, 10:35 PM

#

soft bobcat what I expect is that people have tried low-rank approximations to the low-effec...

oh, that's us. we tried this

#

so even though the effective rank is small, the whole matrix matters

fallen spear Dec 17, 2023, 10:58 PM

#

soft bobcat oh, that's us. we tried this

wait when did we try this

soft bobcat Dec 17, 2023, 10:59 PM

#

this was the point of this channel

#

low effective rank suggests that what is needed is a very complex activation function, but not a large dimension

#

Smerky tried that. no improvements

soft bobcat Dec 17, 2023, 11:01 PM

#

soft bobcat low effective rank suggests that what is needed is a very complex activation fun...

GeGLU is exactly that

fallen spear Dec 17, 2023, 11:02 PM

#

oh, yes, that

#

i think it needs more trying

#

but also i am being picky about my env

soft bobcat Dec 17, 2023, 11:03 PM

#

nanoGPT might work, but I haven't tried, it's on my backlog

fallen spear Dec 17, 2023, 11:16 PM

#

i will probably hack and slash a few repos together into something i feel good about and that fits gracefully on my local

#

i would do pythia but trying to build it gave me hives

#

will probably mimic existing pythia model size/settings tho

dawn vine Dec 18, 2023, 2:57 AM

#

fallen spear i will probably hack and slash a few repos together into something i feel good a...

You're welcome to use my training repo (that I keep delaying open sourcing) if you want something that's easy to configure and supports various transformer types. It also can run remotely on vast or wherever bc it streams the pile

fallen spear Dec 19, 2023, 5:44 AM

#

dawn vine You're welcome to use my training repo (that I keep delaying open sourcing) if y...

I am fiddling with hardware at the moment but would love to look at the code as reference at a minimum

dawn vine Dec 19, 2023, 6:22 AM

#

oh btw I also just got a new model working that's relevant here bc it uses a variation on Gated Attention Unit (GAU) 2202.10447
which is a combination of FFN and Attention into a single layer with only a 2x expansion... uses fewer params so you can double the layer count

#

took some fiddling, but now I have it really performing well (versus equivalents w/ separate 3x expansion FFNs)

dawn vine Dec 19, 2023, 11:41 PM

#

having some significant success even with 1.5x expansion with this method where you get to increase the number of layers

#

makes me wonder if multiple layers in a row of smaller expansion FFNs would have been a more effective use of parameters than traditional transformer alternation of wider ffn and atn

fallen spear Dec 20, 2023, 1:59 AM

#

you have to do them sequentially though, no?

#

i ask because: equiflops, but increases depth of computation and therefore time, maybe

#

in general depth has always been better than width the question is whether there's a regime with no tradeoff

fallen spear Dec 20, 2023, 2:47 AM

#

i think i am personally very convinced that parameters are better off literally anywhere but in the ff

thick briar Dec 20, 2023, 8:09 AM

#

fallen spear you have to do them sequentially though, no?

The sequentiality is what provides the expressive power though

#

Have you seen eqbench? It's a new benchmark for emotional comprehension and intelligence. The Mixtral model, which otherwise matches or surpasses similarly sized models in benchmarks like MMLU, etc., performs barely better than the Mistral models 8x smaller than it

#

I hypothesize that this is due to the lack of depth. 32 layers is just not enough for complex emotional understanding.

#

As you said, depth is better than width in pretty much every way, and while deeper models are harder to train, they're also more parameter-efficient. I believe the next open-source models will be much deeper than the ones that we already have.

#

As an aside, another advantage of scaling depth instead of width is that it gives SSM/other efficient LM architectures a larger hidden state than they would have otherwise.

dawn vine Dec 20, 2023, 7:32 PM

#

thick briar As an aside, another advantage of scaling depth instead of width is that it give...

interesting point, and I think that goes for transformers as well - you just recalc it or cache it

dawn vine Dec 20, 2023, 7:35 PM

#

fallen spear i think i am personally very convinced that parameters are better off literally ...

maybe standard ff is just the dumbest possible thing that implements what we want, and we need better tech for channel mixing just like we have better tech for time mixing (called attention and its variants)

#

meaning that we can't get rid of channel mixing, but we can do it with a much less wasteful algorithm

#

that's my hypothesis for why moving parameters away from it has seemed helpful to date

fallen spear Dec 20, 2023, 7:41 PM

#

dawn vine maybe standard ff is just the dumbest possible thing that implements what we wan...

... that's an interesting thought, what are all the time mixing variants currently out there?

#

i am not up to date on the semi-RNNs bouncing around, if any of their mixer steps can plausibly be done with a low parameter budget one of their functions works as what we're calling an "activation function"

#

and also a half baked theory: it is possible sensitivity to d_ff/d_model would be highly sensitive to initialization. bad initializations will benefit more from the upscale, beacuse it is more likely that some of the parameters are closer to a good initialization since there are more of them. a good initialization may benefit less

boreal moss Dec 23, 2023, 3:39 PM

#

Most likely most of what ffn does is just memorizing stuff, so I doubt that in language models anything will be **much **better than MLP

fallen spear Dec 23, 2023, 6:36 PM

#

boreal moss Most likely most of what ffn does is just memorizing stuff, so I doubt that in l...

anything which is expressible with a 4x up projection could be expressed otherwise since 75% of the up projection will be linearly dependent on each other

#

it seems almost optimally inefficient to me

fallen spear Dec 23, 2023, 6:38 PM

#

fallen spear anything which is expressible with a 4x up projection could be expressed otherwi...

i look forward to hearing of a mathematical nicety that makes this less true

soft bobcat Dec 23, 2023, 6:42 PM

#

if the nonlinearity is fixed as e^(i c x) for trainable weight c, then 4x up projection gets you 4x the spectrum. although a trainable nonlinearity could probably accomplish similar things

fallen spear Dec 23, 2023, 7:00 PM

#

soft bobcat if the nonlinearity is fixed as e^(i c x) for trainable weight c, then 4x up pro...

if you feel inclined: we have, as our ideal scenario, a scale up by two. exactly one half of the outputs before nonlinearity are rescales of the other half. we get extra signal because every case where nonlinearity behavior varies based upon the rescale gives us an extra divide between linear regions in the represented function

#

is that accurate

soft bobcat Dec 23, 2023, 7:01 PM

#

yes. it's like the second half is a rotation, then a nonlinearity in the rotated direction

boreal moss Dec 23, 2023, 7:01 PM

#

linear 1x and on that 4x shift/scale and nonlinearity

fallen spear Dec 23, 2023, 7:03 PM

#

soft bobcat yes. it's like the second half is a rotation, then a nonlinearity in the rotated...

if we take relu as our ideal activation i don't see how this works

#

if your value is below zero so is its rescale

#

you get no information from knowing that x is 3 and 3x is 9

#

you should have the same number of linear regions

soft bobcat Dec 23, 2023, 7:05 PM

#

I'm not sure what you mean by value being below 0, because it's all pre-activation. you can take as an example d_model = 1, d_FF = 2, ReLU activation. then you get two kinks. if d_FF = 1, you get one kink

boreal moss Dec 23, 2023, 7:05 PM

#

Just like 4 different parametric activation functions per every neuron

#

In some cases it will be more efficient but it is some kind of inductive bias

fallen spear Dec 23, 2023, 7:08 PM

#

soft bobcat I'm not sure what you mean by value being below 0, because it's all pre-activati...

we have some FF + activation; the FF is FF(x), the activation is G(x), and their composition is G(FF(x)) which I all also call H(x) for disambiguation. what we care about is the expressivity of H(x)

#

no, I am thinking of the input

#

oh, i will correct

#

so if we assume x is a vector of breadth (concretely) 4, and our FF is a 16x4, 12 of the values of FF(x) are linearly dependent on the other 4

#

if G is relu, we should have the same number of linear regions as if FF was 4x4

#

because a given value of FF(x) will be above 0 if and only if all of its rescales are also

#

so H(x) is not more expressive, there is no set of things in H(x) with 16x4 FF that couldn't be represented with a 4x4

soft bobcat Dec 23, 2023, 7:13 PM

#

are you considering FF to not have weights?

fallen spear Dec 23, 2023, 7:13 PM

#

no, it has weights

#

it is a normal boring very vanilla FF

#

just: it has to be collinear because it is of rank 4

soft bobcat Dec 23, 2023, 7:13 PM

#

ReLU(1+x), ReLU(2+x), ReLU(3+x)...

fallen spear Dec 23, 2023, 7:13 PM

#

oh, you mean bias

boreal moss Dec 23, 2023, 7:13 PM

#

And what happens when you add bias

soft bobcat Dec 23, 2023, 7:14 PM

#

sorry, yes, I meant bias. but note that if one of the inputs is 1, it's equivalent to having bias

fallen spear Dec 23, 2023, 7:14 PM

#

okay, bias means i am wrong. i do peripherally remember that people were removing bias from their networks though

boreal moss Dec 23, 2023, 7:14 PM

#

Yes and it still works

soft bobcat Dec 23, 2023, 7:14 PM

#

so bias and no bias are basically the same, if you allow the NN to optionally decide to fix one x_i to always 1

boreal moss Dec 23, 2023, 7:14 PM

#

Weird

#

Oh right

#

You just need one source of constant value on the input of the network and fnns can emulate bias

fallen spear Dec 23, 2023, 7:16 PM

#

i had a previous theory that it might make sense to reduce activation width and just give it a set of constants

#

this came specifically from the dettmers thing about outliers

#

where it basically used outlier weights as constants

#

but that is tangential; thank you for refining my mathematical intuition for this one

fallen spear Dec 23, 2023, 7:22 PM

#

fallen spear i had a previous theory that it might make sense to reduce activation width and ...

(basically: every possible power of two for your precision)

#

intuition was that if networks were doing a lot of work to create constants you should give them constants so they could not

boreal moss Dec 23, 2023, 7:23 PM

#

That's called bias 🤣

fallen spear Dec 23, 2023, 7:23 PM

#

boreal moss That's called bias 🤣

sure, i don't think that bias has been done

#

trainable bias can in theory do this but in practice has to work to do it

#

when it is so simple that it is basically a noop

#

i don't know what portion of grad has to be oriented to creating a +50 or +100 weight in a well-trained llm but it has to be some significant fraction of it if it's perpetually pushing that value up and keeping it from going down

#

bias neurons like this didn't show huge gains in a previous era but previous eras didn't have well-trained networks spontaneously deciding a single index had to always be unbelievably high

boreal moss Dec 23, 2023, 7:28 PM

#

Is scale of the value making a difference in context of optimizers like Adam?

soft bobcat Dec 23, 2023, 7:29 PM

#

you can just relax the weight decay for biases if you need a huge one. if you init huge biases, they will be huge, but in the wrong direction

fallen spear Dec 23, 2023, 7:30 PM

#

soft bobcat you can just relax the weight decay for biases if you need a huge one. if you in...

i don't see why it doesn't make sense to just give the network values equal to each allowed power of two to calculate with for when it wants a hard bias input

#

it can ignore them if it doesn't want them

#

it consumes a negligible portion of width, e.g. 17 values for float32

#

for clarity: this is a bias activation, not a bias that ever gets applied. it's a constant value that is always there

#

so every FF would go from eg 1024x1024 to 1024 to 1007 and then you'd concat in your biases

soft bobcat Dec 23, 2023, 7:33 PM

#

so the middle neurons of the FF layer are like {H(x), 1, 2, 4, 8...}?

fallen spear Dec 23, 2023, 7:33 PM

#

yes, and similarly for negative powers of two

#

this is completely tangential to the previous thing

#

just thinking about using inputs as constants reminded me of it

soft bobcat Dec 23, 2023, 7:33 PM

#

1, 2, 4, 8 are all equivalent as inputs; all they change is scale of gradients and weight decay

fallen spear Dec 23, 2023, 7:34 PM

#

specifically allowing linear combinations of them makes all numbers representable in the next output

soft bobcat Dec 23, 2023, 7:34 PM

#

but there are weights multiplied to these biases. so 1w = 2 (w/2) = 4(w/4) = ...

fallen spear Dec 23, 2023, 7:35 PM

#

yes but weights are trainable and don't like to fix directly on integers or powers of two or any such direct numerical thing

soft bobcat Dec 23, 2023, 7:36 PM

#

so what you mean is {H(x), 1, 2, 4, 8...} and then not multiply the constants by weights?

fallen spear Dec 23, 2023, 7:37 PM

#

the next layer can use it for whatever it wants to use it for, my first guess was that since outlier weights seemed designed to create freakishly large activations to use to zero out inputs ... sec, cite: https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/

#

it made sense to just always have an available arbitrary-sized activation

#

and then if we were doing one because the network needed one for a specific numerical reason it made sense to try to do a number of them

#

not because I have any operation I specifically want to enable, because in principle you can do different operations with a different set of constants

#

because "some numerical operation that networks may find useful might become computable, or more easily computable, if they have a set of constants across scales"

fallen spear Dec 23, 2023, 7:40 PM

#

fallen spear it made sense to just always have an available arbitrary-sized activation

(my first thought was just INT_MAX because, again, it enables this zeroing operation)

soft bobcat Dec 23, 2023, 7:41 PM

#

reading through Dettmers's post, these constants will not accomplish the same thing. because his outliers move to different dimensions over mini-batches

#

so they have actual signal

#

when you create these constants, you are "making them available" but what you're really doing is bypassing weight decay and gradient scaling for these particular dimensions

fallen spear Dec 23, 2023, 7:42 PM

#

his outliers eventually fix to single indices

soft bobcat Dec 23, 2023, 7:42 PM

#

in particular, if you have a 2^30 constant, the rest of your network will never train

soft bobcat Dec 23, 2023, 7:42 PM

#

fallen spear his outliers eventually fix to single indices

but they don't remain constant

#

assuming he means "residual stream" by "hidden state", my guess is that he's using pre-norm transformers so the purpose of these outliers is to change the scale of the output of layer normalization

fallen spear Dec 23, 2023, 7:48 PM

#

neat

fallen spear Dec 23, 2023, 7:49 PM

#

soft bobcat if the nonlinearity is fixed as e^(i c x) for trainable weight c, then 4x up pro...

e^(icx) is so elegant that i want it to work as a nonlinearity by the way

dawn vine Dec 23, 2023, 9:31 PM

#

boreal moss Most likely most of what ffn does is just memorizing stuff, so I doubt that in l...

MLP is generally great, it's just costly... maybe there is something less expensive that does nearly as well even for this 'memorization' style task?

boreal moss Dec 23, 2023, 9:34 PM

#

maybe the question how compressible is trained mlp should be answered first?

fallen spear Dec 23, 2023, 9:42 PM

#

i think the general answer is that they are absurdly compressible once trained but this does not mean they can be compressed when training

boreal moss Dec 23, 2023, 9:43 PM

#

what makes you think that?

fallen spear Dec 23, 2023, 9:44 PM

#

you can sparsify and quantize networks with relatively little performance loss

#

i would say "no performance loss" but people have apparently stopped caring if their compression techniques were lossy

#

not very long ago people still did that

#

and it generally worked

boreal moss Dec 23, 2023, 9:47 PM

#

quantization yes, but that is more general thing that suggests that we use unnecesary high precision in general, sparsifying I don't think so

fallen spear Dec 23, 2023, 9:48 PM

#

pretraining in lower precision doesn't work

boreal moss Dec 23, 2023, 9:48 PM

#

sparsifying is more like quantization of some number of weights to 1 bit

fallen spear Dec 23, 2023, 9:49 PM

#

sparsifying is throwing out those weights completely

boreal moss Dec 23, 2023, 9:49 PM

#

no

#

don't you need to know which weights? 🙂

fallen spear Dec 23, 2023, 9:50 PM

#

not at runtime, no

boreal moss Dec 23, 2023, 9:51 PM

#

you don't understand what I'm saying or you talking about some kind of structured sparsifying?

fallen spear Dec 23, 2023, 9:52 PM

#

boreal moss you don't understand what I'm saying or you talking about some kind of structure...

boreal moss Dec 23, 2023, 9:56 PM

#

so subset of weight matrix are zeros so those don't need to be multiplied, but you still need to store binary matrix pointing which ones are zeros

thick briar Dec 24, 2023, 4:37 AM

#

How I understand the FFN is a key-value store. q (the input) is matched with k (up_proj) to produce the "attention" scores that determine which items in v (down_proj) are added to the residual stream.

The LLM stores patterns in k, and what it should think if those patterns are matched against in v.

This interpretation makes me suspect that k is overparameterized. While there are probably some patterns that are dependent on all dimensions of the input sequence, it seems more likely that the majority of patterns do not. We could have multiple separate up-projections, some dependent on some dimensions of the input sequence and some dependent on all dimensions.

#

We might also consider how LLMs might be forced to store knowledge. If FFNs are a key-value store, there are only so many keys to match against. Likely, the LLM divides knowledge into categories and assigns each category a key-value pair. When the key detects that the input is looking for knowledge from that category, it spits out all the knowledge it has about that category into the residual stream for future access.

#

This seems extraordinarily inefficient. A better way might be to have an extremely large number of keys that match with a subset of the dimensions of the input sequence, coupled with an extremely large number of values that output onto a subset of the residual stream. This would allow for the LLM to implement fine-grained knowledge storage and retrieval.

soft bobcat Dec 24, 2023, 4:48 AM

#

thick briar We might also consider how LLMs might be forced to store knowledge. If FFNs are ...

my project is somewhat similar but more mathematically oriented: [a manifold is an atlas of charts]. then, a token is a collection of {low-dimensional vector, number that specifies an array index}. the array index is a lookup into a table of charts. charts have matrices which specify how vectors in two charts interact with each other, and these matrices form a graph

#

if you don't understand the part in [ ], you can skip it

#

actually maybe not; the idea looks nonsensical unless you have intuition about how manifolds should behave

boreal moss Dec 24, 2023, 4:45 PM

#

thick briar This seems extraordinarily inefficient. A better way might be to have an extreme...

if every key is from a different subset it is just sparsifying and sparsifying is just a special case of quantization, if those subset are relatively small (so most of the entries in the matrix are zeros), then kolmogorov complexity has to be low, but inference on gpus will not benefit from that, only if somehow we can find some kind of structure in that sparse matrix, we can leverage that and design more efficient architecture with the right inductive biases

soft bobcat Dec 24, 2023, 5:53 PM

#

boreal moss if every key is from a different subset it is just sparsifying and sparsifying i...

"block sparse" is what I'm using, but it's more complicated than sparsifying existing dense networks

fallen spear Dec 24, 2023, 5:57 PM

#

boreal moss so subset of weight matrix are zeros so those don't need to be multiplied, but y...

i realized later that this was correct, that said i think sparse matmul gets more efficient for this when very sparse (eg when you zero entire rows or columns)

#

and iirc you can usually sparsify so far you have to be zeroing out entire rows

fallen spear Dec 24, 2023, 5:58 PM

#

soft bobcat actually maybe not; the idea looks nonsensical unless you have intuition about h...

can confirm that i don't understand it

fallen spear Dec 24, 2023, 5:59 PM

#

fallen spear and iirc you can usually sparsify so far you have to be zeroing out entire rows

or blocks i guess

soft bobcat Dec 24, 2023, 6:07 PM

#

fallen spear can confirm that i don't understand it

here's a non-mathy explanation: a residual stream needs to represent every possible concept that might want to be represented, simultaneously. so if we want to represent a car in the residual stream, we also have to represent that car's relationship to Paris - zero. and its eloquence when giving a speech - zero. most concepts simply aren't related. so it'd be better if we could have a low-dimensional vector, specifically for car-related things, and not have to store all the unrelated parts

#

but if we do this, then our low dimensional vectors no longer have any obvious relationship to other low dimensional vectors. we must store those relationships between pairs of related concepts. so if we have cars and doors, and we want to do Q and K stuff with them - in the regular transformer, we apply Q and K to the giant-dimensional vector. here, we must store a transition matrix that relates cars and doors to each other

fallen spear Dec 24, 2023, 6:15 PM

#

soft bobcat but if we do this, then our low dimensional vectors no longer have any obvious r...

this makes sense to me, thank you

boreal moss Jan 5, 2024, 4:14 PM

#

@fallen spear I didn't read the whole thread but I see that you were using graph from https://arxiv.org/pdf/2310.19956.pdf that suggests that for small models standard mlp ratio of 4 is too big, bad and wrong, this is complete BS because their 41M model with mlp ratio=4 is two layer model and that was the reason it is junk, not the mlp ratio, for mlp ratio=1 it has 4 layers so enough to be working okish and thats why it is better then, and also I think that all their 41M models are highly suboptimal because they just set model dim for 41M parameters wrong.

soft bobcat Jan 6, 2024, 11:29 PM

#

the reason swiglu underperforms geglu seems to simply be the scale of x. if you use swiglu(1.702x)/1.702, it matches geglu. and likewise, if you use swiglu(x/1.702)*1.702, it does a lot worse https://wandb.ai/ad8e/tinystories3?workspace=user-ad8e

W&B

ad8e

Weights & Biases, developer tools for machine learning

#

however, I have to go play tennis and will be back in a few hours

#

that probably means init is really important, and that I have to figure out muP things

soft bobcat Jan 7, 2024, 7:40 AM

#

most of the variants on a(x1) * b(x2) * c(x3) perform basically the same, for functions a, b, c

#

they are all much better than regular GeLU and all have comparable performance to each other and GeGLU, as long as a b c are ok-ish

#

for example, exp cannot be any of the three

soft bobcat Jan 7, 2024, 8:52 AM

#

ReLU^2 is giving pretty bad results. I think because the gradient is not capped, and instead grows as |x|

#

logsigmoid(x) * sin(x) also has bad results and shares this gradient issue

soft bobcat Jan 7, 2024, 7:28 PM

#

testing results on small models:
High confidence:

multiplying multiple activation functions is better than not multiplying. that means these things all work and are much better than GeLU:
GeLU(x1) * x2 = GeGLU
GeLU(x1) * GeLU(x2)
GeLU(x1) * sinc(x2)
GeLU(x1) * ReLU^2(x2)
GeLU(x1) * ELU(x2)
x1 * logsigmoid(x2) * sin(x2)
you can go up to 3 and the performance is the same, up to noise:
tanh * x2 * logsigmoid
gelu * x2 * elu
GeLU * linear * linear

Medium confidence:

scale of the gradient of an activation function needs to not grow too fast
exp is unsuitable as an activation function, though different inits may change things
quadratic (ReLU^2) may build up gradient issues. this may be bad at higher scales. same with logsigmoid(x) * sin(x). multiplying two quadratics may cause issues

#

theory speculation, unconfirmed by experiment:
symmetry seems to be bad. so GeLU(x1) * GeLU(x2) is worse than GeLU(x1) * x2. and x1 * x2 is worse than GeGLU
linear also has symmetry. maybe there would be a better activation function than linear, in GeLU
perhaps only one of the activation functions should have a 0-zone

soft bobcat Jan 7, 2024, 10:08 PM

#

my current best activation function is tanh(x1) * x2^2

#

it does a little better than GeGLU, but the improvement is much less than the improvement of GeGLU > GeLU

soft bobcat Jan 7, 2024, 10:52 PM

#

https://wandb.ai/ad8e/tinystories3/reports/GeGLU-vs-tanh-quad-vs-GeLU-linear-quad--Vmlldzo2NDQyNTA1

W&B

GeGLU vs tanh quad vs GeLU * linear-quad

Publish your model insights with interactive plots for performance metrics, predictions, and hyperparameters. Made by Kevin Yin using Weights & Biases

dawn vine Jan 8, 2024, 1:05 AM

#

a lot of the time this stuff doesn't hold up on larger models, annoyingly

#

@boreal moss tried a ton of stuff and found some good ones for small models tho

soft bobcat Jan 8, 2024, 1:24 AM

#

my models are 10M, trained for 50M chars. the benefits show early in training and then are marginal later, but persist. they're above the noise floor but they're not that important

#

I am a bit skeptical of cubic functions right now because functions with higher gradients seem to show instability

boreal moss Jan 8, 2024, 3:48 AM

#

@soft bobcat Yes, but when I added to "trilinear" ffn a "very leaky" tanh looks like it tolerates higher learning rates

fc1 = nn.Linear(dim, ffdim*3)
fc2 = nn.Linear(ffdim, dim)

x = self.fc1(x)
x = torch.tanh(x) + 0.1*x
x = torch.chunk(x, 3, -1)
x = self.fc2(x[0] * x[1] * x[2])

didn't tested this extensively, just run one test and compared to no activation function for dim=256 ffdim=512

soft bobcat Jan 8, 2024, 3:50 AM

#

sounds reasonable, the tanh should restrict the gradients to a reasonable range

#

the three-part FFNs I tested were no better than GeGLU; mostly, they were equivalent. maybe I didn't look in the right places

soft bobcat Jan 10, 2024, 8:19 PM

#

this probably explains why the effective rank was so low

soft bobcat Jan 10, 2024, 11:16 PM

#

changing GeLU(x) -> GeLU(x) * x performs equivalently to GeGLU = GeLU(x1) * x2 https://wandb.ai/ad8e/tinystories single activation?workspace=user-ad8e

W&B

ad8e

Weights & Biases, developer tools for machine learning

soft bobcat Jan 11, 2024, 1:47 AM

#

SE Gyges has been gone for a week now

icy sentinel Jan 11, 2024, 3:03 AM

#

soft bobcat SE Gyges has been gone for a week now

i hear discord gets really salty if you do a chargeback on them

#

you might want to tag him on twitter if it's important, otherwise will be back when back

icy sentinel Jan 11, 2024, 3:06 AM

#

soft bobcat that probably means init is really important, and that I have to figure out muP ...

if you want to do some clever math, consider assuming that the existing initialization is mUP (even if not) and modifying the init accordingly when you modify the activation

#

what "accordingly" means I have no idea

soft bobcat Jan 11, 2024, 3:06 AM

#

I mentioned Gyges a few times to send him info, which can be found by Ctrl+F mentions of his name

icy sentinel Jan 11, 2024, 3:06 AM

#

Appreciate it 🙂

soft bobcat Jan 11, 2024, 3:07 AM

#

all initializations for a fixed model are muP, in the old meaning of muP. because muP did not give any indication of correct inits, and only said how to transfer inits from one scale to another

#

the new muP paper does give math on how to scale params, and I'm working it out on my model now

icy sentinel Jan 11, 2024, 3:09 AM

#

i would guess that for very simple modifications (ie reducing layer width) you can simply rescale to maintain the infinite-width limit

#

for less simple ones like geglu god have mercy on your soul

soft bobcat Jan 11, 2024, 3:10 AM

#

I think all the transformer activation functions are pretty simple now

#

muP only tells you how to scale. it's mostly independent of your activation function

#

and what you need to figure out for the activation function is what the multiplier for that function should be. which is guesswork

#

at least, for transformers this is true. for a series of FC layers without normalization, it would be more difficult

icy sentinel Jan 12, 2024, 2:05 PM

#

i think i finally have an environment i approve of as a reproducible lab for a/b tests and all i had to do was build a machine, rework the dockerization of gpt-neox, and make a new branch off of v1.0 of it

icy sentinel Jan 12, 2024, 2:28 PM

#

icy sentinel i think i finally have an environment i approve of as a reproducible lab for a/b...

my standards for reproducibility are possibly unreasonable high

icy sentinel Jan 12, 2024, 3:04 PM

#

boreal moss <@441658587404697600> I didn't read the whole thread but I see that you were usi...

this table is also here: https://arxiv.org/abs/2001.08361

arXiv.org

Scaling Laws for Neural Language Models

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within ...

#

if anyone has straight tested this at higher sizes I have not seen it

#

now that i have an environment i approve of i think i a/b test at ratio 4, 2, 1, with and without scaling the initialization

boreal moss Jan 12, 2024, 4:14 PM

#

this one shows similar minimum and simultaneously proves my point, here ratio=0.5 is worse than ratio=4

icy sentinel Jan 12, 2024, 6:37 PM

#

boreal moss this one shows similar minimum and simultaneously proves my point, here ratio=0....

ratios below 1 are always bad, and this seems uncontroversial; if there is a clear demonstration that ratios above 4 are especially good I have not seen it

#

it is a little bit frustrating that there is relatively little good testing on this count but the information we do have does not seem to indicate strongly that the up projection has a very large effect; it has an effect, but it is not large, and allocating the same parameters to either depth or attention heads seems to outperform

dawn vine Jan 12, 2024, 6:47 PM

#

icy sentinel it is a *little bit* frustrating that there is relatively little good testing on...

have you seen the mamba architecture? single block type does both FFN and Linear Attention
ratio is 2... but they do the attention in the expanded dimensionality and use twice the layers (giving them twice the attention)
you could try it with 1x expansion instead and 4x layers and compare

icy sentinel Jan 12, 2024, 6:49 PM

#

dawn vine have you seen the mamba architecture? single block type does both FFN and Linear...

if i switch to looking at mamba arch i will have to tweak my environment until i am convinced it gives me a clean a/b test of a hypothesis with it and this will make me very sad

#

or at least be time consuming

#

it is a good idea though

#

i have "understand rwkv and mamba" on my todo somewhere

dawn vine Jan 12, 2024, 6:52 PM

#

what are you comparing currently? pythia?

icy sentinel Jan 12, 2024, 6:53 PM

#

yes

#

and by comparing i mostly mean "contributing code to gpt-neox"

dawn vine Jan 12, 2024, 7:19 PM

#

what size model? blink tends to chide me that I have to test on L32D2048 or it doesnt always hold up for larger models

#

i've 'discovered' amazing things that work great on smaller models like L12D768 but die at 400m tok trained on larger ones

#

this kind of scale effect may be especially relevant here for FFN sizing

#

its pretty annoying to train a model that large tho 😦

icy sentinel Jan 12, 2024, 8:29 PM

#

yeah i am trying to make sure my setup has a 1:1 larger-scale analogue so it's easier to bump upwards

#

effectively: rerunning same analysis is a matter of swapping out like three params and running it on a real node

soft bobcat Jan 12, 2024, 9:11 PM

#

icy sentinel yeah i am trying to make sure my setup has a 1:1 larger-scale analogue so it's e...

time to read muP! https://arxiv.org/abs/2310.17813

#

(I'm still figuring out depth scaling though)

icy sentinel Jan 12, 2024, 9:13 PM

#

soft bobcat time to read muP! https://arxiv.org/abs/2310.17813

ngl i was just going to blindly jump pythia sizes

#

if you're figuring out mup perfectly gpt-neox needs its mupperizer fixed

icy sentinel Jan 12, 2024, 9:13 PM

#

soft bobcat time to read muP! https://arxiv.org/abs/2310.17813

oh, but also: if i scale from ratio of 4 to 2 or 4 to 1, what is the right correction to apply to the init

#

if i'm doing it nice and manually

soft bobcat Jan 12, 2024, 9:16 PM

#

icy sentinel oh, but also: if i scale from ratio of 4 to 2 or 4 to 1, what is the right corre...

this is the initialization and LR I'm using; it takes in i and o, the input and output dimensions

📎 mup_init.py

#

currently I'm waiting on my boss to see how much I'm allowed to talk about

#

gpt-neox is using the old mup repo so it will never be fixed unless someone PRs an implementation of the new spectral muP paper

#

the old mup repo is useless

winter grotto Jan 14, 2024, 6:33 AM

#

icy sentinel ratios below 1 are always bad, and this seems uncontroversial; if there is a cle...

deepseek MoE seems to do fine with them

#

i wonder why

wind wadi Jan 14, 2024, 9:49 AM

#

winter grotto deepseek MoE seems to do fine with them

Probably the MoE part, yeah? Though you're still limiting max rank of the model to the (smaller) expert size. Even if a single expert is low rank the sum of experts can still be high rank?

winter grotto Jan 14, 2024, 10:05 AM

#

wind wadi Probably the MoE part, yeah? Though you're still limiting max rank of the model ...

i... don't feel like it works that way? you're only summing the outputs of the ffn, not the hidden states

wind wadi Jan 14, 2024, 12:29 PM

#

The sum of two linear maps has dimension >= the max of both maps' ranks, and with K experts + top K sampling you're giving the model more chances (& incentive?) to separate signals from individual experts and produce a higher rank map per MoE layer. If the MoE layer projects two inputs from different time steps into two different subspaces, then the attention mechanism could restore information lost by the low rank projection(?). So I guess you're not specifically bounded by the smaller expert dimension? I'm probably wrong somewhere; I need to sleep.

wind wadi Jan 14, 2024, 12:37 PM

#

winter grotto i... don't feel like it works that way? you're only summing the outputs of the f...

you're only summing the outputs of the ffn, not the hidden states
I think I need more explanation on this (bc I'm oblivious); I probably addressed the wrong issue

winter grotto Jan 14, 2024, 2:48 PM

#

you're probably correct i just don't know how linear algebra works

icy sentinel Jan 14, 2024, 3:40 PM

#

winter grotto you're probably correct i just don't know how linear algebra works

have you not done a crash course that makes it to matrix rank yet

icy sentinel Jan 14, 2024, 3:47 PM

#

winter grotto deepseek MoE seems to do fine with them

i had not read this previously

#

this is roughly the same trick as multihash routing just without the hash

#

you define your standard sized ffn as many smaller ones

#

it is equivalent

#

you do however get to do routing many times

#

in the limit you define experts of size one matrix row each and route separately for each

#

it is a good trick and i have been trying not to think about it bc there are too many things you can do with MoE

boreal moss Jan 14, 2024, 6:15 PM

#

I have to ask, what exactly we are searching for now here?

icy sentinel Jan 14, 2024, 6:59 PM

#

boreal moss I have to ask, what exactly we are searching for now here?

previous testing was all some combination of things which altered the d_ff/d_model ratio at equiparams a la geglu

#

will probably do that again

#

i personally still want to a/b test directly simple ablation on ratio

#

i just got lost for a prolonged period when trying to ensure i had a reproducible setup

#

my standard for 'reproducible' comes from maintaining build systems and is somewhat more stringent than is considered normal or sane in ml

#

or in maintaining build systems

boreal moss Jan 14, 2024, 7:04 PM

#

your point was to find best parameter efficient ratio right? I think the problem is that you can't really decouple dff/dmodel from other things, so you can't really measure what you want to measure

icy sentinel Jan 14, 2024, 7:05 PM

#

i can literally just reduce the degree of up projection and it's a standalone change

#

but to stay at equiparams, yeah, you have to fiddle the entire ff setup, again similar to geglu, if you want to keep things "the same" while fiddling activations

boreal moss Jan 15, 2024, 12:17 AM

#

icy sentinel but to stay at equiparams, yeah, you have to fiddle the entire ff setup, again s...

There may be a useful trick on how to change the dim of the model without changing the dim of the attn layer, just calculate the attn layer on a part of the tensor, and the rest go straight through, this is used in some CNNs for other reasons and works just fine, should work on transformer, example:
for ratio=4 you use model_d=1024 ff_d=4096 normal attention layer
for ratio=1 you use model_d=2048 ff_d=2048 and put half of the tensor through attention layer, another half go straight through.

icy sentinel Jan 15, 2024, 12:18 AM

#

boreal moss There may be a useful trick on how to change the dim of the model without changi...

oh that's neat

#

i think we only need that if we are moving params out of the ff into the d_model and we want to leave the attn alone

#

bt noted

boreal moss Jan 21, 2024, 11:03 PM

#

so, if 1x ratio is okish for standard FFN, bilinear types with the same parameter count will have 0.66x ratio, are they still better or this breaks at this point? anyone tested this case?

dawn vine Jan 22, 2024, 4:38 AM

#

since the theory here is that ffn_ratio is an inefficient use of parameters, do you guys have opinions on DeepSeekMoE? it adds together the results of a ton of different low ffn_ratio networks (selected as 'fine grained expert segments') to create a high ffn_ratio network akin to a traditional high ffn_ratio expert/FFN

#

the idea being that a single FFN w/ ffn_ratio=4 is the same as the sum of 8 smaller FFNs w/ ffn_ratio=0.5

dawn vine Jan 22, 2024, 5:03 AM

#

also, @fallen spear have you considered increasing d_model but reducing both FFN and attention dimensions as a potentially maximally effective use of parameters?

fallen spear Jan 22, 2024, 5:04 AM

#

dawn vine also, <@441658587404697600> have you considered increasing d_model but reducing ...

i think i would want to trade off d_model and attention separately

#

attn is its own can of worms

dawn vine Jan 22, 2024, 5:04 AM

#

yeah but maybe a huge embedding size is whats really important

fallen spear Jan 22, 2024, 5:04 AM

#

it is definitely one of the main important things

dawn vine Jan 22, 2024, 5:05 AM

#

i'm doing a lot of work with MoE at the moment so I'm thinking about this stuff a lot since its inherently very FFN centric usually

fallen spear Jan 22, 2024, 5:05 AM

#

that is part of what lead me to being this deranged about FFNs, yeah

dawn vine Jan 22, 2024, 5:06 AM

#

do u have opinions about what I was saying about deepseekMoE above?

fallen spear Jan 22, 2024, 5:06 AM

#

i have a pre-existing obsession with "hash routing", which should probably not be called that, and one of the things they did in that paper is in the same vein as deepseekmoe

#

@dawn vine #off-topic message previous rant here

#

i had another one where i was speculating about routing with an lsh, as i sometimes do and should really get around to testing

#

ah, it was in https://discord.com/channels/729741769192767510/1138564813996437705

dawn vine Jan 22, 2024, 5:12 AM

#

fallen spear ah, it was in https://discord.com/channels/729741769192767510/113856481399643770...

that second one is something @boreal moss asked me to try in my MoE experiments just the other day 🙂 i havent yet tho

fallen spear Jan 22, 2024, 5:13 AM

#

dawn vine that second one is something <@1071837397693775973> asked me to try in my MoE ex...

since apparently at least three people have had this idea if you do not do it now you will get scooped

dawn vine Jan 22, 2024, 5:14 AM

#

im fine with getting scooped, but i will also try it 😉

fallen spear Jan 22, 2024, 5:14 AM

#

it seems like that should work, and also splitting into successively smaller and smaller merge-able experts a la deepseek/multihash should work

#

and they are both free

#

ie, they do not make the model more expensive in any meaningful way

dawn vine Jan 22, 2024, 5:14 AM

#

theres literally no reason it shouldnt work as well as layers, since at worst its just the same

fallen spear Jan 22, 2024, 5:15 AM

#

routing fails to train

#

for both

#

assuming routing is well-behaved, both approaches are strictly better than layers ordinarily are. maybe cache behavior is worse though

#

you cannot anticipate which layer is next

#

but assuming routing does not degenerate and cache behavior does not kill you they both seem like sure things

dawn vine Jan 22, 2024, 5:16 AM

#

this is all for a bolt-on to RWKV and I have the interesting problem that token-shift is not convenient for MoE, so for now I'm using untokenshifted additional experts and considering the base rwkv chanmix as a kind of 'shared expert'

fallen spear Jan 22, 2024, 5:17 AM

#

i know what some of those words are

dawn vine Jan 22, 2024, 5:19 AM

#

I will rephrase: this is all for a bolt-on to RWKV, which uses a non-standard FFN, so I'm just keeping the non-standard FFN as is and adding (literally adding the results) of extra experts

#d_ff/d_model + swiglu tests