Sparse Coding | EleutherAI | Page 5

shell mural Aug 14, 2023, 1:25 PM

#

I dont think similarity between bases means anything if they're for different layers. Different layer different distribution

#

Yea this seems reasonable to me

bitter turtle Aug 14, 2023, 1:27 PM

#

I thought you were looking for similarity between the bases for different layers; I think I'm very confused at a high level here about

what you are trying to do
how you are trying to do it

shell mural Aug 14, 2023, 1:30 PM

#

My bad for poor communication lol

bitter turtle Aug 14, 2023, 1:30 PM

#

no problem

shell mural Aug 14, 2023, 1:33 PM

#

At the highest level I'm aiming for much stronger circuit analysis

#

And strong analysis of complex circuits

#

What this might look like is you fix some dataset/distribution of data, or maybe some benchmark/eval

#

We can use various interpretability techniques to narrow down what attention heads and MLPs are involved, make hypotheses for what functional purpose they serve, maybe do some causal scrubbing

#

And then great we have a giant computational graph that represents the circuit and appears to be the main stuff that matters for the behavior of interest

#

We can probably even use dictionary learning to make the graph and circuit even better

#

Logan did it before

#

But a missing component I'm seeing in interpretability is the ability to analyze (and utilize!) the edges of that graph. Like we know that sure the output of this plugs into the input of that, but what is that input/output

#

Well, skipping over more refined stuff like attention head outputs, we can probe into the residual stream at any point and use dictionary learning to decompose it into sparse combinations of features that sometimes seem to have actual meaning to them

#

What I see in the channel currently is actually we can understand the residual stream pretty damn well actually. This seems pretty huge to me

bitter turtle Aug 14, 2023, 1:48 PM

#

Ok, what are those same questions but within the scope of sparse coding specifically?

shell mural Aug 14, 2023, 1:50 PM

#

I want to use sparse coding on the residual stream before and after a transformer block, to better understand what the transformer block is doing

#

I would rather work in
A -> Attn -> A -> MLP -> A
Rather than
B -> Attn -> C -> MLP -> D
Where A,B,C,D are dictionaries for the residual stream.
A is trained on data from different spots
B,C,D are localized to that spot

bitter turtle Aug 14, 2023, 1:57 PM

#

I guess my confusion is that doesn't that implicitly make the same assumptions as this?

shell mural Aug 14, 2023, 1:59 PM

#

Hmm. Maybe? 😅

#

I'm boarding a flight now, I'll have time to think about this more

#

But I think you're right lol. I feel pretty dumb now

shell mural Aug 14, 2023, 2:24 PM

#

Nvm back in my camp. no i don't think it implicitly makes those assumptions, though we may end up having less interpretable individual features. Hmmm

bitter turtle Aug 14, 2023, 2:45 PM

#

shell mural Nvm back in my camp. no i don't think it implicitly makes those assumptions, tho...

Excited to hear your conclusions on this, I'm doing inter-layer stuff this week

pallid current Aug 14, 2023, 6:23 PM

#

question: do you think that for all the autointerp stuff we should run it on a different chunk of data to the ones that the models/decompositions have been trained on?

#

at the moment we use chunk 0 to train all of the decompositions and then also for all of the interp stuff

hallow wyvern Aug 14, 2023, 6:28 PM

#

BTW with very minimal modification to your code, I was able to train an autoencoder on roneneldan/tinystories-33m, and it finds meaningful features. Behold, the mojibake feature:

#

I'll clean up the changes tonight and send in a PR to support that model and to describe the process in the readme, I don't think it'll conflict with any of the stuff you've done recently

Edit: OK, there was more cleanup than I thought. Making PR tomorrow.

pallid current Aug 14, 2023, 6:31 PM

#

have also asked lee, looks like he's online

pallid current Aug 14, 2023, 7:16 PM

#

hallow wyvern BTW with very minimal modification to your code, I was able to train an autoenco...

ah very nice!

#

wanted to check whether we tend to find directions which match up with either embed or unembed matrices, but if i've not made any mistakes then seems like there's pretty much none of that, which is honestly very surprising to me

#

#

layer patterns seem right

#

i thought r32 layer 0, which is the case where it benefits from the large ratio the most, might be really latching onto tokens

keen pivot Aug 14, 2023, 7:55 PM

#

pallid current

Oh cool! It is quite above random, though I really do expect layer 5 to have way higher cs than layer 4 in the unembed. Maybe something going on w/ the mean, since the last layer learns less features than the other layers (ie has more dead features).

#

I do expect unembedding dims to be a linear combination of dictionary features (in the last layer), and to priviledge more common tokens than less common.

keen pivot Aug 14, 2023, 8:00 PM

#

hallow wyvern BTW with very minimal modification to your code, I was able to train an autoenco...

I'd expect ablating the direction cause the next few tokens to be worse predicted.

#

Also, there's minimal code to be able to search for features that activate in the last token position of some custom text in my notebook if you've seen it.

bitter turtle Aug 14, 2023, 8:41 PM

#

pallid current at the moment we use chunk 0 to train all of the decompositions and then also fo...

Yes, probably. I wrote a little util for activation generation today, I'll add a feature for this and do a PR.

pallid current Aug 14, 2023, 9:17 PM

#

bitter turtle Yes, probably. I wrote a little util for activation generation today, I'll add a...

wait whats the diff btwn that and the activation generation funcs that already exist, either for train and interpret?

#

i should be able to do this today but i feel pretty awful atm so if you did thatd be great

pallid current Aug 14, 2023, 9:29 PM

#

keen pivot Oh cool! It is quite above random, though I really do expect layer 5 to have way...

yeah i wouldnt trust the results up at 16 - 32x, some of them aren't converged for sure

pallid current Aug 14, 2023, 9:41 PM

#

pallid current i should be able to do this today but i feel pretty awful atm so if you did that...

ok am doing this now, saving what would be the 101th chunk of the pile as a test set for this stuff

bitter turtle Aug 14, 2023, 9:46 PM

#

pallid current wait whats the diff btwn that and the activation generation funcs that already e...

No this is just a command line util for doing that + also generating multiple layers activations at once

bitter turtle Aug 14, 2023, 9:46 PM

#

pallid current ok am doing this now, saving what would be the 101th chunk of the pile as a test...

Oh lol

pallid current Aug 14, 2023, 9:47 PM

#

bitter turtle No this is just a command line util for doing that + also generating multiple la...

look at make_one_chunk_per_layer in standard metrics

#

no parallelization tbf so it could be way faster

keen pivot Aug 14, 2023, 10:02 PM

#

@pallid current

pallid current Aug 14, 2023, 10:13 PM

#

what's the actual task that this data comes from?

pallid current Aug 14, 2023, 11:02 PM

#

bitter turtle Oh lol

wait sorry id been totally stupid, the interp stuff uses openwebtext by default so it's already off the training distribution

bitter turtle Aug 14, 2023, 11:18 PM

#

pallid current what's the actual task that this data comes from?

Pronoun prediction, dumb hacked together task, need a big dataset to do a solid test for sure

pallid current Aug 14, 2023, 11:48 PM

#

early stages in doing the big auto interp results but early returns are pretty based

#

#

lot of recent anxiety about ica quelled by this image lol

pallid current Aug 15, 2023, 1:55 AM

#

general picture emerging is that winrate of sparse coding vs all baselines on top or top-random is very high, but a lot lower on random-only, where ica-top-k is quite competitive and sometimes our dicts just do quite poorly

rancid summit Aug 15, 2023, 5:56 AM

#

what's the rough normalized MSE / sparsity you're getting in gpt2sm?

pallid current Aug 15, 2023, 5:59 AM

#

hey, these are the graphs we're getting in the residual stream for pythia70M, we dont have gpt2sm results to hand i dont think

rancid summit Aug 15, 2023, 6:19 AM

#

hmm how are you calculating unexplained variance? (or is that the same thing as normalized MSE)?

#

(normalized MSE: MSE / (target -target.mean())**2)

pallid current Aug 15, 2023, 6:20 AM

#

rancid summit hmm how are you calculating unexplained variance? (or is that the same thing as ...

yep that's it, residuals = (batch - x_hat).pow(2).mean() total = (batch - batch.mean(dim=0)).pow(2).mean() return residuals / total

rancid summit Aug 15, 2023, 6:20 AM

#

also the points near 0 are a bit hard to tell, is there a log plot?

rancid summit Aug 15, 2023, 6:20 AM

#

pallid current yep that's it, ``` residuals = (batch - x_hat).pow(2).mean() total = (bat...

cool

#

oh also how big is the latent here

pallid current Aug 15, 2023, 6:21 AM

#

rancid summit also the points near 0 are a bit hard to tell, is there a log plot?

dont have a log plot to hand, can make one tomorrow easily if you're interested. latent dim is 512

rancid summit Aug 15, 2023, 6:22 AM

#

no worries no need to go out of your way

pallid current Aug 15, 2023, 6:23 AM

#

we've generally focused on regime of 100 or fewer active features, though the autointerp metrics are surprisingly not that strongly correlated with l1 coef

rancid summit Aug 15, 2023, 6:23 AM

#

yeah just trying to get a sense of what reasonable normalized MSE scores are

#

because the anthropic toy model results were like 1e-3 and I was like "woah that's pretty low"

pallid current Aug 15, 2023, 6:24 AM

#

rancid summit because the anthropic toy model results were like 1e-3 and I was like "woah that...

is that public?

rancid summit Aug 15, 2023, 6:25 AM

#

yeah this was in one of the recent updates

#

the plots with the bounce

pallid current Aug 15, 2023, 6:28 AM

#

i think we've kind of stopped looking for an exact right like dict size or l1 value like they saw with that bounce, like maybe there is some way but with the llms/autoencoders we're using it seems to always be pretty smooth

rancid summit Aug 15, 2023, 6:28 AM

#

I see

#

another Q: any intuition on where on the sparsity/reconstruction tradeoff you want to be?

#

(context: I'm currently doing something like the kurtosis based autoencoder thing I described last time, and sparsity is controlled with a very weird set of hparams (as opposed to just tuning L1), so I haven't been paying it much attention. but maybe I should)

pallid current Aug 15, 2023, 6:31 AM

#

the way i was hoping to answer it was to use the autointerp scores, using random scoring, to quantify the overall % of the variance of the layer which is captured by the explanation

rancid summit Aug 15, 2023, 6:31 AM

#

I see

pallid current Aug 15, 2023, 6:32 AM

#

basically weighting interp scores by feature variance. issue is that if there's signal in the interp scores of different dict sizes and l1 values (within a reasonable range) then it's pretty slight

#

also the performance of these dicts is strongest on top or top-random scoring, the random performance is a little disappointing

rancid summit Aug 15, 2023, 6:34 AM

#

maybe worth trying random-among-activating?

#

like random but throw out anything that the relu clamps to 0

pallid current Aug 15, 2023, 6:35 AM

#

oh interesting, yeah we're screening sentences for nonzero variance but i guess you mean not including nonzeros in the correlation calculation?

#

might also be hurt by having to use 3.5 as the simulator, when i've looked at the simulations they're quite painfully dumb often

rancid summit Aug 15, 2023, 6:36 AM

#

4 is better but also pretty dumb

#

tbh I don't actually know what would be a really good metric

#

also I'm surprised your dictionary is so small

#

and the sparsity is not that high as a fraction of the dictionary

#

does larger dictionary+more sparse help?

#

in my experiments I've been looking at really big really sparse autoencoders (like 100 active/10k dictionary)

pallid current Aug 15, 2023, 6:42 AM

#

yeah i agree it's a bit surprising how well the small dicts work, i think there's some reason that sparse coding seems to work that we dont fully understand/doesn't match our intuitions - though we do go up to 32x ratio = 16k feats, or 72k feats for MLP (though those were probably a mistake 😅). i think it might be that the larger ones take longer to converge and also that they struggle to learn closely related features

rancid summit Aug 15, 2023, 6:43 AM

#

interesting

#

wdym by struggling to learn closely related features?

pallid current Aug 15, 2023, 6:43 AM

#

you start to get lots of features dying by that stage. we talked about trying reinitialization methods where if a feature hardly ever activates it you reinitialize it, randomly or with some residual vector or something but we haven't put time into it

rancid summit Aug 15, 2023, 6:43 AM

#

ah I see

pallid current Aug 15, 2023, 6:44 AM

#

rancid summit wdym by struggling to learn closely related features?

just an intuition, unsubstantiated that because of the l1 penalty when the features are closely packed it's harder for a dict element to find a new feature rather than the degenerate 'just turn off' solution

rancid summit Aug 15, 2023, 6:45 AM

#

got it

pallid current Aug 15, 2023, 6:46 AM

#

hopefully will publish current results to get some interest and then try to dig into really understanding what's going on a bit more

rancid summit Aug 15, 2023, 6:46 AM

#

makes sense

#

excited for the paper!

pallid current Aug 15, 2023, 6:46 AM

#

thanks! would love to get your feedback on a draft in a week or two if you've got time

rancid summit Aug 15, 2023, 6:47 AM

#

yeah will def take a look

#

are you planning to run any gpt2sm results?

pallid current Aug 15, 2023, 6:54 AM

#

yeah the code is set up to be able to run it, might even have some dicts floating around already. will def run a proper set before we publish. any metrics you'd be particularly interested in?

#

i dont think we'll have the credits to do that much autointerp on it (unless we got a load more haha 👀), would def be interesting to be able to directly compare with your original results but would need gpt-4 logprobs for that

pallid current Aug 15, 2023, 6:55 AM

#

rancid summit in my experiments I've been looking at really big really sparse autoencoders (li...

oh btw what's the reason you chose 10k feats/100 sparsity?

rancid summit Aug 15, 2023, 6:55 AM

#

pallid current oh btw what's the reason you chose 10k feats/100 sparsity?

oh well I'm trying many different feature counts

#

and 100 active is not something I can easily control directly

#

there are like 4 knobs that might in theory affect the number active? but I've never actually tried to make active go up or down

pallid current Aug 15, 2023, 6:56 AM

#

rancid summit and 100 active is not something I can easily control directly

yeah fair, but is there some signal about what number of feats might be appropriate? (meaning the 10k not the 100)

rancid summit Aug 15, 2023, 6:56 AM

#

hmm

#

I've just been making the dictionary bigger and bigger lol

#

my intuition is that there must surely be lots of features that don't activate 99% of the time

#

as opposed to only somewhat more features than model dim that don't activate 70% of the time

#

unfortunately making the dictionary bigger doesn't always improve loss in my setup unless you tune some knobs

pallid current Aug 15, 2023, 6:59 AM

#

yeah i agree there must in some sense be an incredibly long tail of feats, though i think that in order for the model to be able to actually work with that amount of superposition it must translate dataset features into a latent space which allows a lot of compression

#

and in that case it's unclear to me what number of feats really means

rancid summit Aug 15, 2023, 7:00 AM

#

why would that be necessary?

#

you can still pick out each feature if you want to use it

#

most things just won't operate on most features but that's fine

pallid current Aug 15, 2023, 7:01 AM

#

rancid summit you can still pick out each feature if you want to use it

yeah treating each feature on it's own you can isolate arbitrarily many separate feats

#

but we can tell from the non-sparsity of the neuron basis that it's not actually picking out just tiny parts of the residual stream to act on

#

so there must be some degree of similarity between how it treats similar parts of the residual stream which increases as features get closer

rancid summit Aug 15, 2023, 7:03 AM

#

pallid current but we can tell from the non-sparsity of the neuron basis that it's not actually...

hmm not sure this implies that

pallid current Aug 15, 2023, 7:03 AM

#

i think you should be able to get some kind of interesting bound on superposition by making sure that the noise doesn't grow exponentially as you do sequential computation on the features but i've not yet had the chance to really try and flesh it out

rancid summit Aug 15, 2023, 7:04 AM

#

maybe the MLP uses multiple neurons to add lots of bends to a function that operates on one feature

#

maybe the MLP is actually doing some crazy non neuron basis aligned computation

#

I don't see reason to expect the MLP basis to be anything too sane

pallid current Aug 15, 2023, 7:05 AM

#

rancid summit maybe the MLP is actually doing some crazy non neuron basis aligned computation

hmmm i think thats kinda possible but high levels of superposition make it incredibly difficult for it to be robust, at least as i understand it

#

like the position of the nonlinearity will change a lot depending on which other features are active which makes it super hard to build a complex nonlinear func

#

i ran some mlp tests recently to try and gauge to what extent you still get a robustly nonlinear response curve when features were distributed across multiple neurons and it seems that it gets pretty linear after you start to spread your mlp features over just a handful of neurons

#

but then clearly we do seem to see feats across neurons, from sparse probing and sparse coding etc

#

so im pretty confused basically

bitter turtle Aug 15, 2023, 7:10 AM

#

pallid current i ran some mlp tests recently to try and gauge to what extent you still get a ro...

What were these tests?

pallid current Aug 15, 2023, 7:13 AM

#

bitter turtle What were these tests?

generating a synthetic dataset as we would for toy models but constraining each feature to be across no more than n dimensions, and then for each feature, reading it directly with a linear layer + GELU and seeing whether you still got a real nonlinearity

#

so with n=1 you just see standard GELU

#

but as you increase n you start to see a pretty flat response curve

#

this is, for 200 dim space, n=1, 200 feats; n=3 500 feats; n=10 1000 feats

#

bump at 2 is an artefact of how i do sparsity

bitter turtle Aug 15, 2023, 7:18 AM

#

Ok, what's the 'linear layer' in the 'linear layer + GELU' here

pallid current Aug 15, 2023, 7:19 AM

#

bitter turtle Ok, what's the 'linear layer' in the 'linear layer + GELU' here

linear layer is the transpose of the feature generation matrix, so each entry in the linear layer is the coefficients of one of the features

bitter turtle Aug 15, 2023, 7:19 AM

#

Oh ok

pallid current Aug 15, 2023, 7:22 AM

#

hope that makes sense lol, can explain in the morn, am off to bed now

bitter turtle Aug 15, 2023, 7:22 AM

#

pallid current hope that makes sense lol, can explain in the morn, am off to bed now

Not really but cool

#

see you tomorrow/at meeting

pallid current Aug 15, 2023, 7:24 AM

#

feel like ive never managed to explain it well which might mean im chatting shit 😅 , see ya

bitter turtle Aug 15, 2023, 7:24 AM

#

~~I'm just quite confused as to why you think this shows anything about MLP structure ig~~ I misunderstood maybe, is the structure superposed features -> GELU -> linear filter, or superposed features -> linear filter -> GELU?

pallid current Aug 15, 2023, 8:46 PM

#

bitter turtle ~~I'm just quite confused as to why you think this shows anything about MLP stru...

maybe the latter? not sure what you mean by linear filter

#

its v short, will upload as a gist

#

https://gist.github.com/HoagyC/79bab2eea8a1f572e7dc5ed4e0428556

Gist

Testing when superposed features are still able to elicit a non-lin...

Testing when superposed features are still able to elicit a non-linear response from a neuron-wise nonlinearity. - linearity_test.py

bitter turtle Aug 15, 2023, 9:07 PM

#

cool will read later

bitter turtle Aug 15, 2023, 9:14 PM

#

pallid current https://gist.github.com/HoagyC/79bab2eea8a1f572e7dc5ed4e0428556

I guess I would be convinced if it was a learned encoding not a random one.

pallid current Aug 15, 2023, 9:25 PM

#

yeah thats definitely a big weakness but its also kinda generous in terms of like, only distributed across a few feats, not crazy ratios or anything. would be interesting to compare neat geometric patterns or learned feats

#

but i do think its useful as a counterweight to just like, johnson-lindenstrauss therefore exponentially many feats in a layer, like you cant just approach it naively and retain a meaningful nonlinearity, (unless ive misunderstood somehow)

#

should probably just throw sparse learned feats or something similar into it and see wht comes out

pallid current Aug 16, 2023, 3:19 AM

#

ok its very annoying that we didnt realise this earlier but there's a huge difference between tied and untied dictionaries in the MLP

#

at least by n feats active

#

e.g. compare number / % active for layer 2, second one is untied

keen pivot Aug 16, 2023, 3:48 AM

#

pallid current e.g. compare number / % active for layer 2, second one is untied

This is middle-MLP, right?

pallid current Aug 16, 2023, 6:04 AM

#

keen pivot This is middle-MLP, right?

yep

bitter turtle Aug 16, 2023, 9:29 AM

#

Ah

pallid current Aug 16, 2023, 7:17 PM

#

mmcs for untied on MLP is pretty terrible which somewhat explains the bad results

#

doesnt explain why we're getting those bad mmcs scores though! what's changed??

#

@keen pivot have you been doing any MLP training runs recently on the old code? wanna compare hparams

keen pivot Aug 16, 2023, 7:46 PM

#

pallid current <@360082080975290369> have you been doing any MLP training runs recently on the ...

I still have the old code up to compare. Which ones?
LR = 1e-3

pallid current Aug 16, 2023, 7:47 PM

#

what does mmcs look like over epochs?

#

hmm yeah i can try with 1e-3

keen pivot Aug 16, 2023, 7:47 PM

#

I've only got MLP_out.

#

But I got awful MCS hist doing 3e-4, and just got much better doing 1e-3

pallid current Aug 16, 2023, 7:48 PM

#

run just now?

keen pivot Aug 16, 2023, 7:48 PM

#

Yep

#

20 chunks

#

10 chunks

keen pivot Aug 16, 2023, 7:50 PM

#

keen pivot 10 chunks

The above is 1e-3. Doing 3e-4 for 10 chunks, it produced one like this (peaks around 0.4), but maybe 30 above 0.9

bitter turtle Aug 16, 2023, 7:59 PM

#

hhhhhhhhhhhhhhuh

#

that's incredibly weird and I don't understand that at all.

#

So I switched to full-rank ablations on a whim and we still beat LEACE at earlier layers in the model? Need to reconfigure feature selection procedure to count and compare with different database sizes, but it looks like we are finding a single direction to do full-rank ablations on better than LEACE

pallid current Aug 16, 2023, 8:04 PM

#

bitter turtle So I switched to full-rank ablations on a whim and we still beat LEACE at earlie...

have you looked at some basic interp on the feature you're ablating just to get a sanity check?

bitter turtle Aug 16, 2023, 8:05 PM

#

yeah, logan did, 722 (main one) activates on female pronouns or something, which is slightly confusing? going to check if it's gaming the metric somehow

#

perfectly plausible that it doesn't make much sense to do this kind of thing on a model with such low baseline prediciton accuracy

pallid current Aug 16, 2023, 8:06 PM

#

bitter turtle yeah, logan did, 722 (main one) activates on female pronouns or something, which...

ok well yeah that sounds about right

keen pivot Aug 16, 2023, 8:07 PM

#

bitter turtle yeah, logan did, 722 (main one) activates on female pronouns or something, which...

I do think female pronouns sounds more relevant than "capital letter Q" feature (just to give an example)

bitter turtle Aug 16, 2023, 8:07 PM

#

oh, for sure

keen pivot Aug 16, 2023, 8:07 PM

#

Not too difficult to just grab a middle layer of Pythia 1.4b to train a dict real quick & re-do results, if you want?

bitter turtle Aug 16, 2023, 8:08 PM

#

yeah will do soonish

bitter turtle Aug 16, 2023, 8:10 PM

#

keen pivot Not too difficult to just grab a middle layer of Pythia 1.4b to train a dict rea...

what l1 values/dict sizes worked well for that?

keen pivot Aug 16, 2023, 8:13 PM

#

bitter turtle what l1 values/dict sizes worked well for that?

I'd expect 0.001 & 4x-8x. Would need to check my run from back then.

bitter turtle Aug 16, 2023, 8:14 PM

#

👍

keen pivot Aug 16, 2023, 8:15 PM

#

@pallid current , Is there an easy way for me to load in the dataset of the first chunk (specifically what PCA/ICA were trained on)? I don't want just the layer activations, but the original text/tokens

pallid current Aug 16, 2023, 8:20 PM

#

hmmm not like suuper easy, depends what exactly you need. in activation_dataset.py you can use make_sentence_dataset to get you a big load of sentences from the beginning of the pile. you can calculate tokenize each sentence and calc how many tokens will go into 1 chunk (2GB / 4 bytes i think) and just stop there and that's your data

#

note that ICA is trained for 1 chunk but PCA batched across 10

bitter turtle Aug 16, 2023, 8:21 PM

#

wait one second logan I have a util for this standby pushing now

pallid current Aug 16, 2023, 8:21 PM

#

if you want more exact idea of what does in you'd need o modify chunk and tokenize to return sentences

bitter turtle Aug 16, 2023, 8:22 PM

#

on my branch I edited activation_dataset.py and added a script to generate chunks

keen pivot Aug 16, 2023, 8:23 PM

#

bitter turtle 👍

I checked. I did
layer=15,
l1 = 1e-3 (w/ 12k, I predict sparsity of ~50-60)
dict=6x is probably best.
n_chunks: probably 20 (I did 10, but was still converging on the 8x dict)

tied, lr=1e-3, though I also had a bias for the decoder.

bitter turtle Aug 16, 2023, 8:23 PM

#

I expect not having decoder bias is ~fine, should be ~centered at 0

keen pivot Aug 16, 2023, 8:23 PM

#

https://wandb.ai/sparse_coding/sparse coding/groups/EleutherAI%2Fpythia-1.4b-deduped_15_0712-040209-EleutherAI%2Fpythia-1.4b-deduped-15_graphs/workspace?workspace=user-elriggs

If you can see this.

W&B

sparse_coding

Weights & Biases, developer tools for machine learning

bitter turtle Aug 16, 2023, 8:44 PM

#

keen pivot The above is 1e-3. Doing 3e-4 for 10 chunks, it produced one like this (peaks ar...

what batch sizes are these?

pallid current Aug 16, 2023, 8:44 PM

#

mmcs looks good!

bitter turtle Aug 16, 2023, 9:02 PM

#

for what?

pallid current Aug 16, 2023, 9:03 PM

#

^ this image if this is to me

bitter turtle Aug 16, 2023, 9:36 PM

#

ah, sure

#

@pallid current did a PR, this one is probably a bit messy and awful to work through sorry, since I rewrote some datagen code to make it work with more models & added a util for saving activations

keen pivot Aug 16, 2023, 9:50 PM

#

bitter turtle what batch sizes are these?

--n_chunks=20 --layer=15 --mini_runs=10 --batch_size=1024

#

From the overview button (on the top-left of wandb)

#

Though I would like to note that I trained it for 10 mini_runs over 20 chunks. So 200 chunks!

bitter turtle Aug 16, 2023, 9:52 PM

#

Oh jesus

keen pivot Aug 16, 2023, 9:56 PM

#

https://wandb.ai/sparse_coding/sparse coding/groups/EleutherAI%2Fpythia-70m-deduped_2_0720-192301-EleutherAI%2Fpythia-70m-deduped-2_graphs/workspace?workspace=user-elriggs

W&B

sparse_coding

Weights & Biases, developer tools for machine learning

#

This one is for sure tied ae, and still converging (when looking at MMCS) after 50 chunks

#

This is an error on my part, because I mixed up the "n_chunks" & "mini-runs" part. I do think dictionaries trained under 10 chunks (which is what we've normally done) are undertrained for larger dicts

#

This is for 10 chunks (top) and 5 chunks (bottom)

#

And this is for 50 chunks (top) and 45 chunks (bottom):

bitter turtle Aug 16, 2023, 10:03 PM

#

bitter turtle <@566946805028225034> did a PR, this one is probably a bit messy and awful to wo...

you might want to move the activation width stuff out of utils as it requires transformerlens which takes a while to load

#

pod shitting itself rn; what's the data in 3 and 5?

pallid current Aug 16, 2023, 10:08 PM

#

dunno abt 3/5 (tho think logan's using 3), i'm running an 8gpu sweep for checking higher lr low bach

#

can shut it off in a sec

#

@keen pivot logan can you run something like that but with MLP please?

#

my small batchsize/higher lr 10 chunk untied mlp didnt see much of a shift, v poor mmcs above like 3e-4

keen pivot Aug 16, 2023, 10:20 PM

#

bitter turtle pod shitting itself rn; what's the data in 3 and 5?

I'm probably 3 & 5. Also training a new 160m

keen pivot Aug 16, 2023, 10:20 PM

#

pallid current <@360082080975290369> logan can you run something like that but with MLP please?

MLP-middle? Tied or untied?

pallid current Aug 16, 2023, 10:20 PM

#

probably both? yeah postnonlin

#

untied if you can only do 1

keen pivot Aug 16, 2023, 10:21 PM

#

@pallid current, the pca_topk is size 1k, and ica_topk is size 500 in the mnt/.../baselines/ folder. Is there a same-sized version.

keen pivot Aug 16, 2023, 10:22 PM

#

pallid current probably both? yeah postnonlin

on the old code?

bitter turtle Aug 16, 2023, 10:22 PM

#

pallid current my small batchsize/higher lr 10 chunk untied mlp didnt see much of a shift, v po...

wait, worse MMCS as lr increases?

pallid current Aug 16, 2023, 10:22 PM

#

keen pivot on the old code?

yeah, really want to be able to recreate whatever worked before

pallid current Aug 16, 2023, 10:22 PM

#

bitter turtle wait, worse MMCS as lr increases?

no, as l1a increases, soz

bitter turtle Aug 16, 2023, 10:22 PM

#

oh right

keen pivot Aug 16, 2023, 10:23 PM

#

pallid current yeah, really want to be able to recreate whatever worked before

like layer 2, Pythia 70m?

pallid current Aug 16, 2023, 10:23 PM

#

yeah ideally

bitter turtle Aug 16, 2023, 10:26 PM

#

pallid current yeah, really want to be able to recreate whatever worked before

what was the MCS tradeoff stuff for MLP on old code?

pallid current Aug 16, 2023, 10:27 PM

#

tbh i dont remember the tradeoffs in detail, but generally mcs could be solidly high at the normal l1 ranges

bitter turtle Aug 16, 2023, 10:29 PM

#

incredibly confused

pallid current Aug 16, 2023, 10:30 PM

#

bitter turtle incredibly confused

me too

keen pivot Aug 16, 2023, 10:31 PM

#

pallid current yeah ideally

Running! Currently downloading pile. Can send the aws when it's ready

#

Also, I had to change a lot of stuff to make it work, so like 50% I screwed up and it's running something we don't want!

keen pivot Aug 16, 2023, 10:32 PM

#

keen pivot <@566946805028225034>, the pca_topk is size 1k, and ica_topk is size 500 in the ...

@pallid current, any thoughts?

bitter turtle Aug 16, 2023, 10:33 PM

#

pallid current me too

I vaguely remember having much lower l1 values, do I remember this correctly?

pallid current Aug 16, 2023, 10:36 PM

#

keen pivot <@566946805028225034>, any thoughts?

oh sorry, yeah this is the PCA top_k encoder splitting it into positive and negative features as separate things

#

whereas the ICA one doesnt do that

#

agree that shouldnt be a discrepancy

pallid current Aug 16, 2023, 10:37 PM

#

keen pivot Also, I had to change a lot of stuff to make it work, so like 50% I screwed up a...

what kinda things needed to change?

keen pivot Aug 16, 2023, 10:48 PM

#

pallid current what kinda things needed to change?

Like tied to untied, and setting the hyperparams.

pallid current Aug 16, 2023, 11:31 PM

#

finally got round to properly suppressing those aten warnings, sooo much more satisfying to train now chadhuber

pallid current Aug 17, 2023, 1:32 AM

#

ok we've basically waaaay undertrained our mlp dicts

#

yellow being 110 chunk-epochs, green and purple being the original 10

keen pivot Aug 17, 2023, 1:42 AM

#

pallid current ok we've basically waaaay undertrained our mlp dicts

Whoooo!

#

Do you still need some trained MLP ones from the old code? (I've got a run here, but I'm unsure on the hyperparams. I did get between 20-100 sparsity though) https://wandb.ai/sparse_coding/sparse coding/groups/EleutherAI%2Fpythia-70m-deduped_2_0817-004629-EleutherAI%2Fpythia-70m-deduped-2_graphs/workspace?workspace=user-elriggs

W&B

sparse_coding

Weights & Biases, developer tools for machine learning

pallid current Aug 17, 2023, 3:13 AM

#

mlp l1 curve every 10 epochs:

#

#

perf still pretty bad tbh

pallid current Aug 17, 2023, 3:16 AM

#

keen pivot Do you still need some trained MLP ones from the old code? (I've got a run here,...

cheers, from what im seeing the much longer run is sufficient to explain the difference but def good to be able to compare to the old stuff to make sure

keen pivot Aug 17, 2023, 1:45 PM

#

pallid current cheers, from what im seeing the much longer run is sufficient to explain the dif...

Ya, I was mostly unsure cause the MMCS was much lower than the residual stream at the same number of chunks.

May be explained by the tied vs untied though (I remember tied having better MCS given same chunks trained)

fiery tangle Aug 17, 2023, 2:21 PM

#

Hey, exciting work! I’m trying to run this notebook from your LW post. Is the auto encoder ('/root/sparse_coding/auto_encoders.pkl’) available on GitHub / HF?

Google Colaboratory

(tentatively) Found 600+ Monosemantic Features in a Small LM Using ...

Using a sparse autoencoder, I present evidence that the resulting decoder (aka "dictionary") learned 600+ features for Pythia-70M layer_2's mid-MLP (…

keen pivot Aug 17, 2023, 3:04 PM

#

fiery tangle Hey, exciting work! I’m trying to run this [notebook](https://colab.research.goo...

https://huggingface.co/Elriggs/autoencoder_layer_2_pythia70M_5_epochs/tree/main

Elriggs/autoencoder_layer_2_pythia70M_5_epochs at main

#

Should work! Let me know if you run into issues.

keen pivot Aug 17, 2023, 4:05 PM

#

Confirmed: PCA top component is the outlier dimension

#

Guys. PCA_topk & ICA_topk suck soooo much. Like I can't even get a decent one for the apostrophe one.

keen pivot Aug 17, 2023, 4:46 PM

#

#

This is the neuron basis. And I'm zooming in after 5.11 basically:

#

This is the best one I could find! Out of searching top-10.

keen pivot Aug 17, 2023, 6:40 PM

#

#

BAM: Our dicts rock

#

Note: Going larger l1 (closer to the "all dead features" solution), there weren't any features found that weren't the outlier dimension.

Going smaller l1 (closer to the "identity/polysemantic" solution), there weren't any features found that didn't also activate for >10% of the dataset.

bitter turtle Aug 17, 2023, 7:20 PM

#

Explanation of why I'm no longer comparing to LEACE directly since @pallid current asked:
Given a set of labelled points X with classes Z, LEACE guarantees that no linear classifier can predict the Z_i from the X_i well. This is fine and good if your X_i have shape seq_len×d_activation but if you erase on datapoints X_i of size d_activation then you don't guarantee that the model, which essentially sees (seq_len, d_activation) can't implement a linear classifier between them (and in fact it can and does).

This is a problem for direct comparisons as

it's not reasonable to compare LEACE KL-divergence OOD (e.g. on the pile); plausibily still useful to measure on-distribution
ablating a single feature direction is equivalent to a rank-seq_len ablation in the seq_len×d_activation space, so the ablations aren't the same

More generally I think we are aiming at much more general activation engineering with sparse feature dictionaries and it's not clear how to measure this

pallid current Aug 17, 2023, 7:21 PM

#

keen pivot

yessss so glad to see this, nice one!

pallid current Aug 17, 2023, 7:22 PM

#

bitter turtle Explanation of why I'm no longer comparing to LEACE directly since <@56694680502...

so is this why LEACE isn't bang on 0.5 in the comparison graphs you posted yesterday?

bitter turtle Aug 17, 2023, 7:23 PM

#

well, even with the proper ablations it's not perfect at low levels. At layer 6 on pythia-1.4b it get something like 0.56 which is still quite bad and worse than ours

#

I mean, I think I could still compare KL divergence on-distribution, might be something there

#

Like, even if our ablations are rank-k if we get a lower KL it's useful

pallid current Aug 17, 2023, 7:25 PM

#

bitter turtle well, even with the proper ablations it's not perfect at low levels. At layer 6 ...

wait so is that a proper comparison? like even if it can edit seq_len x activation_d and we just edit a single feature or a few, we can remove downstream perf

bitter turtle Aug 17, 2023, 7:25 PM

#

I think so yeah

#

I'll probably write up those results

#

I'll be surprised if we do get better KL divergence because that would have weird implications for models using data mostly linearly or not. LEACE guarantees minimal edit for no linear classification under any inner product norm, including the one 'expected change to KL div per unit shift in this axis' although realistically it's probably horribly nonlinear

#

Yeah nvm that it's going to be horribly nonlinear

pallid current Aug 17, 2023, 7:33 PM

#

right as i understand LEACE it removes the ability to predict from the activations at that layer, but assuming that there's non-linear work done between that layer and the output, its plausible that you can improve on LEACE in terms of whether there's like gender differences in the output

#

but then i suppose you could say even if there's non-linear computation ongoing between layer and output, if at any residual layer the information is stored linearly then linear concept erasure should be able to remove it

keen pivot Aug 17, 2023, 7:37 PM

#

@bronze wraith

pallid current Aug 17, 2023, 7:38 PM

#

keen pivot <@748975058415910923>

hahahaha how did i not see that before

keen pivot Aug 17, 2023, 7:43 PM

#

@bronze wraith

bitter turtle Aug 17, 2023, 7:51 PM

#

pallid current right as i understand LEACE it removes the ability to predict from the activatio...

yeah I'm pretty confused as to why rank-one ablations with our dicts perform so well; perhaps we shift the distribution of the text less so some backup nonlinear weird system doesn't contrib much to the predictions while in LEACE the ablation induces a signficant distributional shift so the model uses some nonlinear prediction method

#

kind of want to see what we can do with a synthetic labelled dataset generated by e.g. GPT-3.5

#

not sure how you'd show very general erasure though

glass tinsel Aug 17, 2023, 8:12 PM

#

@bitter turtle I have very little idea what this experiment actually is so you'll have to explain it to me from the beginning

bitter turtle Aug 17, 2023, 8:12 PM

#

that is entirely fair

bitter turtle Aug 17, 2023, 8:13 PM

#

glass tinsel <@332271551481118732> I have very little idea what this experiment actually is s...

I think generally I was misapplying LEACE horribly and so any results I had previously are bad and should be ignored, I'll get back to you when I have actual results

glass tinsel Aug 17, 2023, 8:16 PM

#

oh ok lol

#

also, do you have access to concept labels at inference time in this experiment?

glass tinsel Aug 17, 2023, 8:32 PM

#

glass tinsel also, do you have access to concept labels at inference time in this experiment?

If so, you should use OracleFitter / OracleEraser which was in the main branch for a while; I just now pushed a v0.2 PyPI release which has it

#

Oracle LEACE drops the assumption that the erasure function has to be an affine transformation, and for each component of the representation in any arbitrarily selected orthonormal basis, directly minimizes the squared distance between the original and the scrubbed value

#

If that still doesn't work you could try Quadratic LEACE, which I've mostly implemented but haven't merged into main yet. It's definitely going to be a less surgical edit but it ensures that no linear or quadratic classifier can extract any info about the concept.

#

Your use case might help me decide a couple details about how to implement QuadraticFitter and QuadraticEraser. Quadratic LEACE is an oracle method, so it requires concept labels at inference time, and it also has this weird property where you have to like "dispatch" each individual data point to a different affine transform depending on its concept label, which is hard to do in an efficient vectorized way. You can do it quasi efficiently with like torch.unique but I'm still unsure about whether to do this batching/preprocessing step on every call to QuadraticEraser.forward or force the user to do the preprocessing all at once or smth

bitter turtle Aug 17, 2023, 9:23 PM

#

Wasn't planning to give it access to labels at inference time/use oracle erasers, but maybe later on I will.

glass tinsel Aug 17, 2023, 9:26 PM

#

ok

pallid current Aug 18, 2023, 12:24 AM

#

made some big summary graphs for residual stream

#

dip in quality seen from some of the 16 and 32 ratio dicts indicates they might be a bit undertrained

glass tinsel Aug 18, 2023, 2:11 AM

#

Have you guys tried end-to-end training the dictionary to minimize loss btw? I suggested this to @keen pivot and he said it was part of the plan

keen pivot Aug 18, 2023, 2:13 AM

#

glass tinsel Have you guys tried end-to-end training the dictionary to minimize loss btw? I s...

Definitely on my future todo and Aidan mentioned it first I believe. It might not make the end-of-the-month to-do’s & results, but it’d be nice!

keen pivot Aug 18, 2023, 5:21 AM

#

Okay, just couldn't fall asleep, so I checked some perplexity stuff. Currently getting quite good perplexity diff relative to what I got earlier, even on only 10 chunks (like what?). Gonna run a few tests to check it out & see if I'm seeing straight.

bitter turtle Aug 18, 2023, 7:16 AM

#

keen pivot Okay, just couldn't fall asleep, so I checked some perplexity stuff. Currently g...

What's this test?

bitter turtle Aug 18, 2023, 10:52 AM

#

bitter turtle I'll be surprised if we do get better KL divergence because that would have weir...

as expected we don't have better KL div on-distribution

#

currently slightly unsure what activation editing directions to persue, could look at

ability for activation editing with feature dictionaries to generalise to multiple tasks (not sure what this would look like from a testing perspective)
comparison to activation editing strategies like nulling the mean-diff vector

#

I expect general problems with robustness because it's not even clear what it means to do distributionally-robust activation editing at this point

bitter turtle Aug 18, 2023, 12:44 PM

#

I think what I'll end up doing for the paper is more like what @keen pivot's doing at the moment, I'll park more ambitious things for another time

bitter turtle Aug 18, 2023, 1:54 PM

#

Still got these results although need proper L1=0 run for baselining @pallid current

keen pivot Aug 18, 2023, 2:37 PM

#

bitter turtle I think what I'll end up doing for the paper is more like what <@360082080975290...

Yooo, still got that weight based MLP residual thing right?

bitter turtle Aug 18, 2023, 2:38 PM

#

yeah still have that to do

#

Not entirely sure what I should be doing with that, can't see any immediate inroads to distilling the relationships between features

keen pivot Aug 18, 2023, 2:39 PM

#

bitter turtle Not entirely sure what I should be doing with that, can't see any immediate inro...

Multiply the features by MLP-in and compare to the dictionary at MLP-in

bitter turtle Aug 18, 2023, 2:40 PM

#

not sure that's neccesarily a useful distilation bc obviously it's very dependent on the nonlinearity

keen pivot Aug 18, 2023, 2:40 PM

#

Sure, but it might work?

bitter turtle Aug 18, 2023, 2:41 PM

#

sure

keen pivot Aug 18, 2023, 2:41 PM

#

I tried from layer 4 to 5 and it sucked, but this might work

#

You could even try applying the layernorm and GeLU before doing MCS to see if it’s meaningful.

bitter turtle Aug 18, 2023, 2:44 PM

#

doubtful

keen pivot Aug 18, 2023, 2:44 PM

#

In my heart-of-hearts I believe it may work best by integrating one in-distribution text, but that's no longer just weight-based

bitter turtle Aug 18, 2023, 2:44 PM

#

yes I think the focus on purely weight-based analysis is somewhat misguided

keen pivot Aug 18, 2023, 2:45 PM

#

But it would be so cool

keen pivot Aug 18, 2023, 3:30 PM

#

Ya, this is basically it. I was only able to get to 30 perplexity by going 6x. 25 is the dream. You could get there by going polysemantic, but no point at that point.

#

Would be interested to see doing KL-div training at this point!

bitter turtle Aug 18, 2023, 3:36 PM

#

For sure can do those graphs shortly

#

What dicts are those?

frank vortex Aug 18, 2023, 3:52 PM

#

check it out i found a math feature

#

I was trying to find features that alight with themselves more after they are transformed by a head

#

that is feature that has high cos_sim(W_OV f, f) where f is a learnt dictionary feature and W_OV is a tranformation from residual stream to residual stream

#

For some reason, features that are sorted in this way tend to be highly interpretable(there seems to be a context)

bitter turtle Aug 18, 2023, 4:00 PM

#

@keen pivot the cosine sims of MLP-out don't look to be especially interpretable, this is a histogram of the gini indexes of MLP_dict @ resid_dict, i.e. a measure of how spread out the effect of one MLP-out direction activating should be; ideally if it were interpretable the effect would be sparse and we'd see this in the plot

#

hists of cosine sim

#

I'm honestly thinking i've got the wrong dicts 🤔

#

it is mlp_out_l2 that flows into residual_l2 right @pallid current?

keen pivot Aug 18, 2023, 4:06 PM

#

bitter turtle it is mlp_out_l2 that flows into residual_l2 right <@566946805028225034>?

It should be! We're doing post_residual for residual which is at the end of the layer.

bitter turtle Aug 18, 2023, 4:10 PM

#

ok so it works with residual thingies in each layer with REALLY HIGH l1 values (like the highest we use). 1e-3 doesn't seem to have this.

#

this is between l1 and l2 btw will check others soon

#

@keen pivot didn't you do something similar to this I can't remember

keen pivot Aug 18, 2023, 4:12 PM

#

keen pivot Ya, this is basically it. I was only able to get to 30 perplexity by going 6x. 2...

Okay, so I'm seeing a research direction here (for later). We can develop a measure of monosemanticity by doing several token-level histograms for lower & lower l1 values. They eventually just suck so much, that you say "that's bad!" and move on. Then we have our target KL & do the KL divergence thing.

keen pivot Aug 18, 2023, 4:13 PM

#

bitter turtle <@360082080975290369> didn't you do something similar to this I can't remember

Nope! Or I can't remember either then

bitter turtle Aug 18, 2023, 4:14 PM

#

keen pivot Okay, so I'm seeing a research direction here (for later). We can develop a meas...

yeah this seems fair for the earlier layers

keen pivot Aug 18, 2023, 4:15 PM

#

bitter turtle yeah this seems fair for the earlier layers

Why earlier layers specifically?

bitter turtle Aug 18, 2023, 4:15 PM

#

token-level stuff

keen pivot Aug 18, 2023, 4:16 PM

#

bitter turtle token-level stuff

I've seen it in layer 5/6, but maybe there's more token-level features earlier?

bitter turtle Aug 18, 2023, 4:17 PM

#

keen pivot I've seen it in layer 5/6, but maybe there's more token-level features earlier?

oh I guess I was talking more generally

keen pivot Aug 18, 2023, 4:17 PM

#

I'd expect more re-tokenization/bigram-esque stuff in earlier layers

#

Yep yep. A statistical argument makes sense.

bitter turtle Aug 18, 2023, 4:17 PM

#

l3-l4

#

seems better for larger dictionaries (still resid-to-resid, but this time r=32)

keen pivot Aug 18, 2023, 4:32 PM

#

frank vortex check it out i found a math feature

"Do some interpretability stuff" at the bottom 😆

keen pivot Aug 18, 2023, 4:35 PM

#

keen pivot "Do some interpretability stuff" at the bottom 😆

Looks cool! I'm confused what the OV self-multiplication is supposed to do, but you're noticing they're more interpretable. Have you tried looking at like 10 features that don't align w/ any W_OV (of any head in that layer)?

#

Also is this the same layer's W_OV or the next layer's?

keen pivot Aug 18, 2023, 4:37 PM

#

bitter turtle l3-l4

Ya, I don't know. The l1=0.01 is so dead though.

bitter turtle Aug 18, 2023, 4:37 PM

#

ikr

keen pivot Aug 18, 2023, 4:37 PM

#

bitter turtle seems better for larger dictionaries (still resid-to-resid, but this time r=32)

This is interesting at least

#

Which models are you loading in? Like the directory?

bitter turtle Aug 18, 2023, 4:38 PM

#

bigrun0308/...

#

uh l3 l4 ratio 32 residual

keen pivot Aug 18, 2023, 4:38 PM

#

Could you try tiedlong_residual... in Hoagy's?

bitter turtle Aug 18, 2023, 4:38 PM

#

/mnt/ssd-cluster/bigrun0308/tied_residual_l3_r32/_9/learned_dicts.pt
/mnt/ssd-cluster/bigrun0308/tied_residual_l4_r32/_9/learned_dicts.pt

bitter turtle Aug 18, 2023, 4:39 PM

#

keen pivot Could you try tiedlong_residual... in Hoagy's?

this mlp or residual

#

is that in hoagy's?

#

one sec

#

oh cool

keen pivot Aug 18, 2023, 4:39 PM

#

Residual in Hoagy's base directory.

#

Nice! So just trained more or something

bitter turtle Aug 18, 2023, 4:45 PM

#

keen pivot Residual in Hoagy's base directory.

there aren't any dicts in there?

#

@pallid current

keen pivot Aug 18, 2023, 4:55 PM

#

bitter turtle there aren't any dicts in there?

Every 20 dicts

#

sparse_coding_hoagy/tiedlong_tied_residual_l4_r4/_80/learned_dicts.p

#

Except the last one for whatever reason.

bitter turtle Aug 18, 2023, 4:56 PM

#

ah

#

ignore the title: this is MLP out r8 for l3 at high l1; much better

bitter turtle Aug 18, 2023, 5:04 PM

#

keen pivot ```sparse_coding_hoagy/tiedlong_tied_residual_l4_r4/_80/learned_dicts.p```

yep these are a lot better

#

also not seeing much difference for higher l1 which is a something sign

#

bitter turtle Aug 18, 2023, 5:31 PM

#

@pallid current do you have zero l1 baselines?

pallid current Aug 18, 2023, 5:32 PM

#

morning everyone 🙂

pallid current Aug 18, 2023, 5:32 PM

#

bitter turtle <@566946805028225034> do you have zero l1 baselines?

i've ran them overnight for mlp

bitter turtle Aug 18, 2023, 5:32 PM

#

ah, what about resid

pallid current Aug 18, 2023, 5:32 PM

#

not for residual, at least across the model yet, soz

bitter turtle Aug 18, 2023, 5:33 PM

#

aok

pallid current Aug 18, 2023, 5:39 PM

#

run died because i had one dict ratio as an int not float lmao [0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16]:

#

but only for untied 🤔 🤔

#

might not be that tbf

frank vortex Aug 18, 2023, 5:47 PM

#

keen pivot Looks cool! I'm confused what the OV self-multiplication is supposed to do, but ...

I need to work out more on the interpretability: Here is what I have observed cos_sim(W_OV f_i, f_i) tend to be greater than cos_sim(W_OV f_i, f_j) for j not equal i, assuming that features correspond to meaningful directions in the residual stream, the transformed feature by an attention head tend to align more with itself than with other features, my hunch was that W_OV is more similar to a compressing features so that they can be recovered (the recovered feature aligns with the original feature) rather than a self multiplication. Also since W_OV is responsible for copying, for features than align with themselves they might be copied.

#

This shows that the OV is responsible for extracting the attribute for some tokens

#

https://arxiv.org/pdf/2304.14767.pdf

#

Atrributes, the way it is defined is taken to be the context for a token, that is very similar to the feature that a token relates, that is just a guess, i am unsure about this reasoning

#

I suppose one intuition or idea is that from the mathematical framework for transformer circuits paper, they put plots of the W_OV matrix eigenvalues. they draw blobs around the eigenvalues that cluster in the positive real direction; having a high CS of (W_OV f, f) means that feature is somewhere in that cluster (maybe not aligning perfectly with an eigenvector but thats ok) and with this library we can interpret it

keen pivot Aug 18, 2023, 6:06 PM

#

frank vortex I need to work out more on the interpretability: Here is what I have observed co...

I would think a vector doesn’t align with another feature (i and j) because they’re different features to begin with.

You’d get the same behavior with W_OV= identity matrix.

#

Thinking more, W_OV from residual to residual is like reading from this direction and writing to that direction.

So here, you’re finding directions that the OV circuit reading from means it writes to that same direction.

#

I wonder how you could connect this to real examples with sequence position.

Like suppose these features do copy information from one sequence to another. Can we see specific examples where that happens at specific sequence positions.

#

@bitter turtle for the record, I don’t quite understand the Gini coefficient stuff interpretation, and may be too tired to understand atm. But to give a take:

We want to know some statistic of how much one dictionary matches with another. For MLP-out, we expect lots of matches, much more than random, especially since it’s just an addition with attention.

keen pivot Aug 18, 2023, 6:23 PM

#

keen pivot Ya, this is basically it. I was only able to get to 30 perplexity by going 6x. 2...

@pallid current, a better version of this graph would show ratios 1-6, and I predict there's improvements up to ratio 6 for perplexity.

bitter turtle Aug 18, 2023, 6:29 PM

#

keen pivot <@332271551481118732> for the record, I don’t quite understand the Gini coeffici...

I agree, I'll just throw down my ideas so far:

we want a given upstream feature to have a sparse impact on downstream features, i.e only impact a few and not all features uniformly (as random would) -- hence gini coeff stuff
MCS is probably not a great indicator here. Many low cosine similarity values probably indicate randomness/unconnectedness. Having a CS of about 0.4 seems to indicate a degree of sparsity in terms of what features impact what other features
we can probably also look at the covariance/correlation instead of the cosine sim (I would do this, but for some reason I am getting like 1800 dead features :/ )

keen pivot Aug 18, 2023, 6:30 PM

#

bitter turtle I agree, I'll just throw down my ideas so far: - we want a given upstream featur...

Why not look at a feature, figure out what it does, and see if weight-based connections gives you any predictive ability on that specific feature?

bitter turtle Aug 18, 2023, 6:31 PM

#

I want to sort out the covariance stuff first

#

Also have you got the histogram thing integrated with the new setup?

#

If not send over the code anyway and I'll give a go on converting it

shell mural Aug 18, 2023, 6:44 PM

#

keen pivot I wonder how you could connect this to real examples with sequence position. Li...

yep thats something im interested in. well originally i was thinking: can we find in a somewhat unsupervised fashion, prompts that result in activating these OV directions

#

but also how to track when information is flowing from one feature to another along a qk pair

keen pivot Aug 18, 2023, 7:46 PM

#

shell mural but also how to track when information is flowing from one feature to another al...

Is that some conceptual work to figure this out, or just try the first 3 experiments that come to mind and see what sticks?

#

I just trained the 160m & sent it off to Lucia & Lovis. Whooo! My first Trello to-do done. WHooOOOooOOOooOOOoooo

shell mural Aug 18, 2023, 7:47 PM

#

keen pivot Is that some conceptual work to figure this out, or just try the first 3 experim...

a healthy mix of both

#

inspired from a few papers i read recently

keen pivot Aug 18, 2023, 7:50 PM

#

Nice! Definitely feel free to post intermediate results & half-baked hypotheses here:)

shell mural Aug 18, 2023, 7:52 PM

#

frank vortex I need to work out more on the interpretability: Here is what I have observed co...

like this was a hacky idea we threw together, the idea being that if we do for some reason find some semblance of copying behavior then it might be indicative of richer more abstract features, rather than something token-level. that would be harder to prove but we can just chuck it in and see what we get, so we did

keen pivot Aug 18, 2023, 7:54 PM

#

shell mural like this was a hacky idea we threw together, the idea being that if we do for s...

What if you did that, but just on dictionary learned on attn_out?

shell mural Aug 18, 2023, 8:01 PM

#

what, as in compute (f_i, W_OV^-1 f_j)?

keen pivot Aug 18, 2023, 8:14 PM

#

No. F_i is from a dictionary trained solely on the output of attn_out. Then you'll get non-token-level features for sure.

shell mural Aug 18, 2023, 8:15 PM

#

oh yeah

#

narmeen has just informed me that we already have those dictionaries. so i guess we should look at them lol

keen pivot Aug 18, 2023, 8:46 PM

#

lololo

pallid current Aug 19, 2023, 12:48 AM

#

been struggling with the autointerp on mlps all day. very mixed results on the longer runs, usually doing quite well but num dead feats is high even with l1 at literally zero

keen pivot Aug 19, 2023, 2:21 AM

#

pallid current been struggling with the autointerp on mlps all day. very mixed results on the l...

Weeeeiiird

bitter turtle Aug 19, 2023, 10:12 AM

#

pallid current been struggling with the autointerp on mlps all day. very mixed results on the l...

This maybe makes sense if it's learning an orthogonal basis/double that basis

#

Like I wouldn't expect any dead features up to ratio 2 but maybe from 2< I would

fading valley Aug 21, 2023, 7:56 AM

#

Would appreciate it if someone could provide feedback on this idea

For the purpose of an enumerative safety result for reward models learned during RLHF:

extract n parameters for which the largest updates occurred during RLHF
transplant them to the base model to confirm their applicability in reducing loss on some measure that quantifies the success of your RLHF
(assuming the success of the last step:) obtain a representation exhibiting less superposition of some layer like in https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tentatively-found-600-monosemantic-features-in-a-small-lm for both the base model and the RLHFd model (assumedly the difference should at least partially encode information about the learned reward model)
try to train an autoencoder to reconstruct activation vectors for the RLHFd model when given activation vectors from the base model as input in order to quantify how good the reconstructions from the previous step were (better reconstructions should be better at predicting the behavior of the RLHFd model assuming the autoencoder is able to properly reconstruct inputs for just the base and RLHF models separately)

Ideally resulting in an understanding of which parameter shifts caused different features to affect the outputs of the inspected layer

prime obsidian Aug 21, 2023, 8:15 AM

#

@keen pivot Hi! What marc/er mentioned is the project I told you I was working on when we met IRL

bitter turtle Aug 21, 2023, 1:06 PM

#

fading valley Would appreciate it if someone could provide feedback on this idea For the purp...

My initial take is that this would require improvements in the sparse autoencoders we are currently using, we are not currently able to accurately reconstruct inputs to the degree I'd think would be needed for this kind of experiment. I'm also not sure why the final step helps quantify goodness of reconstruction for this, could you elaborate maybe? I'm also not sure how you would be able to usefully extract information out of the difference between the decompositions learned in step 3. Excited to hear more!

keen pivot Aug 21, 2023, 1:59 PM

#

prime obsidian <@360082080975290369> Hi! What marc/er mentioned is the project I told you I was...

Lol, I was about to connect you two!

keen pivot Aug 21, 2023, 2:03 PM

#

fading valley Would appreciate it if someone could provide feedback on this idea For the purp...

It may make sense to just learn a dictionary on the reward model’s layers (and find features that lead to low and high reward). Does this make sense?

Additionally, you’ll have the RLHF model do better than the base model on some metric.

If you have dicts for both, then you can find the features that are responsible for one doing better than another.

bitter turtle Aug 21, 2023, 2:11 PM

#

keen pivot It may make sense to just learn a dictionary on the reward model’s layers (and f...

How would you use dict differences in the third paragraph?

keen pivot Aug 21, 2023, 2:13 PM

#

bitter turtle How would you use dict differences in the third paragraph?

You find the features that are useful for a given task for each dict, then see which ones are different (ie MCS, qualitative analysis, input/outputs?)

bitter turtle Aug 21, 2023, 2:17 PM

#

I guess I'm just not sure how robust our learned dictionaries are to like initialisation state etc.

keen pivot Aug 21, 2023, 2:59 PM

#

bitter turtle I guess I'm just not sure how robust our learned dictionaries are to like initia...

Ya that's good. So doing that experiment too would be good (ie train two dicts of the same size & check their different features on the same task)

fiery tangle Aug 21, 2023, 3:52 PM

#

Hey, is there a notebook for this LW post available anywhere?

Really Strong Features Found in Residual Stream — LessWrong

[I'm writing this quickly because the results are really strong. I still need to do due diligence & compare to baselines, but it's really exciting!] …

keen pivot Aug 21, 2023, 4:05 PM

#

fiery tangle Hey, is there a notebook for this [LW post](https://www.lesswrong.com/posts/Q76C...

There's an older version of code here: https://github.com/loganriggs/sparse_coding/blob/main/interpreting_sparse_dictionaries.ipynb

w/ the dictionary here: https://huggingface.co/Elriggs/autoencoder_layer_2_pythia70M_5_epochs/tree/main

GitHub

sparse_coding/interpreting_sparse_dictionaries.ipynb at main · loga...

Contribute to loganriggs/sparse_coding development by creating an account on GitHub.

Elriggs/autoencoder_layer_2_pythia70M_5_epochs at main

#

Then the latest stuff:
https://github.com/loganriggs/sparse_coding/blob/main/lucia_and_lovis.ipynb

W/ the dictionaries for pythia-160m also in that directory

GitHub

sparse_coding/lucia_and_lovis.ipynb at main · loganriggs/sparse_cod...

Contribute to loganriggs/sparse_coding development by creating an account on GitHub.

keen pivot Aug 21, 2023, 4:33 PM

#

Yoooo, ablating the apostrophe feature mostly only effects the "s". (Note: I specifically searched for a feature that activated on the sentence: |Then we went to Dave'|, which is the possesive apostrophe, than other contractions)

keen pivot Aug 21, 2023, 9:07 PM

#

When working w/ dicts across layers, I noticed the dicts seem:

monotonically decreasing in MMCS (& # of features above 0.9)
monotonically increasing in peak MMCS w/ respect to l1

pallid current Aug 21, 2023, 9:09 PM

#

layer 5 really seems to be something very different and less suited to sparse coding it seems, always looks v different

#

like the graphs though!

keen pivot Aug 21, 2023, 9:10 PM

#

pallid current layer 5 really seems to be something very different and less suited to sparse co...

Layer 5 is so weird

pallid current Aug 21, 2023, 9:11 PM

#

prob be good to log the xaxis

keen pivot Aug 21, 2023, 9:13 PM

#

It's weird that last layer peaks when first layer drops. Would need to run on another model before making any conclusions

#

Oh, to clarify: This is MMCS from dicts with their l1, um, "neighbor". So 3e-3 & 4e-3, and 4e-3 & 5e-3.

keen pivot Aug 21, 2023, 9:15 PM

#

pallid current prob be good to log the xaxis

This is log, but on a smaller dict (last one was ratio 8, this is ratio 1)

#

pallid current Aug 22, 2023, 1:31 AM

#

doing some graphs of % features active for the appndix

bitter turtle Aug 22, 2023, 1:52 AM

#

What L1 values are we using for MLP?

#

And what do the reconstruction-sparsity plots for MLP look like?

bitter turtle Aug 22, 2023, 1:53 AM

#

pallid current doing some graphs of % features active for the appndix

You can maybe factor the legend out

pallid current Aug 22, 2023, 2:54 AM

#

bitter turtle What L1 values are we using for MLP?

mlp generally have been using a sweep of 8 from 1e-4 to 1e-2, though as you can see there's no real need to go above 1e-3

pallid current Aug 22, 2023, 2:55 AM

#

bitter turtle And what do the reconstruction-sparsity plots for MLP look like?

tied:

#

untied:

fading valley Aug 22, 2023, 7:52 AM

#

bitter turtle My initial take is that this would require improvements in the sparse autoencode...

The purpose of the final step is to be able to benchmark the accuracy of your interpretation of the reward model. Assumedly if you're able to predict the activations of the RLHFd model using just the base model and the autoencoder you've trained to reconstruct activations from the RLHFd model given activation vectors from the base model your method has done a good job

fading valley Aug 22, 2023, 7:55 AM

#

keen pivot It may make sense to just learn a dictionary on the reward model’s layers (and f...

This sounds good, I'm mostly just looking to reduce noise by narrowing our scope to include mostly components of the model involved in reward modelling

bitter turtle Aug 22, 2023, 9:09 AM

#

fading valley The purpose of the final step is to be able to benchmark the accuracy of your in...

Yes, but I'm not sure how possible this is. It would be interesting if possible, but again 🤷

#

Like, have people even successfully done base model activations -> RLHFd model activations with an autoencoder? The final step seems to rely on this being possible.

fading valley Aug 22, 2023, 9:15 AM

#

bitter turtle Like, have people even successfully done base model activations -> RLHFd model a...

Not from what I've seen. If it's not possible it won't have been that large of a time investment (I would imagine) and might be useful later if sparse coding continues to be pursued

bitter turtle Aug 22, 2023, 9:15 AM

#

Sure

bitter turtle Aug 22, 2023, 11:27 AM

#

various activation editing techniques across depth; for the first few layers there are single features we can ablate that basically solve the task, but in later layers there aren't

#

the most reasonable explanation I can come up for this is that since we are ablating sparsely-activating directions, our edits are more 'in-distribution' than e.g. LEACE's edits at earlier layers

#

oh, 'mean' here is 'ablate along the difference-in-means direction when considering all token positions as unique datapoints' which is like laughably untrue but hahahaha idk any other techniques that operate on a per-token-position basis

bitter turtle Aug 22, 2023, 1:46 PM

#

histograms for various best dict features for ablating for pronoun prediction

keen pivot Aug 22, 2023, 2:48 PM

#

bitter turtle histograms for various best dict features for ablating for pronoun prediction

Good that these make sense for the early layers where it works!

One concern w/ layer 3 & 4 (which isn't much of a concern because they're bad) is that they might be outlier features, which ablating them makes a lot of tasks get worse (one indication is that the top tokens are sort of sorted by token frequency, but you could just do the "visualize feature" function to see if it activates for the first token & first delimiter).

bitter turtle Aug 22, 2023, 3:19 PM

#

yeah, I can ~tell which ones they are by their activation magnitude, I'll check for you

#

Ok, initialy it doesn't seem like they are

keen pivot Aug 22, 2023, 3:48 PM

#

bitter turtle yeah, I can ~tell which ones they are by their activation magnitude, I'll check ...

Ya the activation mag is a tell, though I have seen some positional features with lower mag, I think?

bitter turtle Aug 22, 2023, 10:11 PM

#

I think for future activation editing investigations (inc potentially investigations to work into this current paper if we still have time after I get back from holiday), it'd be really useful to have a sweep at 4xdict size and some l1 value done of pythia-410m or something of equivalent scale, so I might set one off shortlyish; @keen pivot @pallid current do you have any similar requirements/wants wrt sweeps of larger models?

pallid current Aug 22, 2023, 10:20 PM

#

i dont have any particular needs but agree that sweeps of larger models would be good

#

my main ask would be to include some quite large dict sizes and make sure to train for a long time to make sure the bigger ones aren't undertrained, to see if we can get a sense for where diminishing returns to size comes in for larger models

keen pivot Aug 22, 2023, 11:15 PM

#

pallid current my main ask would be to include some quite large dict sizes and make sure to tra...

Like train for 100 chunks, save every 5?

pallid current Aug 22, 2023, 11:17 PM

#

keen pivot Like train for 100 chunks, save every 5?

yeah i would want to see the equivalent of 20-30 chunks for 70m so yeah probably up to 100 accounting for larger activations and maybe slower convergence, and making sure that we're tracking mmcs and loss/sparsity over time to see if we're missing out

pallid current Aug 23, 2023, 7:26 AM

#

been testing some variants of forcing the mlp directions to strictly be in the positive quadrant (and bumping up all the inputs by min(gelu)

#

loss curves are an absolute rollercoaster, and generally terrible in terms of convergence speed but i do kinda suspect there's something there

#

#

notably, that purple line is 2x overcomplete, competitive in loss (eventually, after being about 10,000x worse for the first 50 epochs 😅) with normal runs, and maintains about 100% live features

#

v possible its learning the identity or some degen solution tho

pallid current Aug 23, 2023, 9:07 PM

#

turns out it was learning some weird degen solution but one which meant that all the autointerp came out as 'newlines/periods', at least for the nonzero l1 runs

#

much to ponder, but leaving it it for now

keen pivot Aug 24, 2023, 2:22 PM

#

pallid current turns out it was learning some weird degen solution but one which meant that all...

New lines periods is like the outlier dimension for Pythia models

bitter turtle Aug 25, 2023, 4:43 PM

#

@cosmic yarrow that kind of thing is absolutely something I'm interested in doing, feel free to discuss in here

bitter turtle Aug 26, 2023, 3:14 PM

#

bitter turtle <@846082367974146057> that kind of thing is absolutely something I'm interested ...

(wait, do people receive pings if they are in threads?)

cosmic yarrow Aug 26, 2023, 3:27 PM

#

bitter turtle (wait, do people receive pings if they are in threads?)

I didn't receive a ping but very glad I checked this thread! I currently have a decent amount of free time and would be happy to help in whatever way.

cosmic yarrow Aug 26, 2023, 3:42 PM

#

cosmic yarrow I didn't receive a ping but very glad I checked this thread! I currently have a ...

I've been adapting Yun's code for this, but if you are thinking of having these experiments as part of the existing codebase for this project I'd be happy to ditch my codebase and work on a fork or contribute to your efforts.

bitter turtle Aug 26, 2023, 4:07 PM

#

I mean I'm kind of dissatisfied with the current codebase as-is, I'd like to clean it up/redesign things to have more flexibility in general. I'm also on holiday for about two weeks at the moment, so I won't be working on it in that time. It probably makes sense to integrate it with the current codebase though, yeah

#

I think in general I at least am kind of unsure what will happen after we finalise the current paper

cosmic yarrow Aug 26, 2023, 5:01 PM

#

bitter turtle I think in general I at least am kind of unsure what will happen after we finali...

Good to know! I'll check back in in two weeks and we can discuss more then—or whenever you're ready. In the meantime I'll continue refactoring the code I have so it's in a better place to integrate, in case that's what ends up happening. Have a nice vacation!

pallid current Aug 26, 2023, 9:58 PM

#

hey @cosmic yarrow, what kind of experiments are you thinking about doing?

bitter turtle Aug 27, 2023, 7:29 AM

#

It was looking into reproducing/extending Yun's paper more, either by doing multiple layer dictionaries or using the FISTA/solve dict/basically K-SVD iterative method described in the paper.

gilded merlin Aug 28, 2023, 2:17 PM

#

How to get involved in this project

cosmic yarrow Aug 28, 2023, 3:01 PM

#

pallid current hey <@846082367974146057>, what kind of experiments are you thinking about doing...

My overall goal is to see if some of the circuit tracing/causal tracing methods can be adapted to explore these dictionaries. The first thing I wanted to do is extend Yun's method by also training a dictionary for the mid-residual stream, right after attention. And then, using the COUNTERFACT dataset, adapt methods like Geva et al https://arxiv.org/pdf/2304.14767.pdf for tracing information flow across the sparsified layers. This is all very hand-wavy for now, apologies if it doesn't make sense.

shell mural Aug 28, 2023, 3:06 PM

#

oh hey. me and a couple others are thinking about the same stuff. we're diving on mor geva's work rn

#

me, @hallow wyvern, @onyx compass, @frank vortex

#

im setting up notebooks for causal intervention techniques w/ counterfact dataset with narmeen, firstuserhere is getting a head start on geva's work, faulsname spent time a few weeks back getting familiar with the dictionary learning codebase

#

we should chat

cosmic yarrow Aug 28, 2023, 3:09 PM

#

shell mural we should chat

Yes, definitely, would love to be involved in this effort!

bitter turtle Aug 28, 2023, 3:13 PM

#

shell mural im setting up notebooks for causal intervention techniques w/ counterfact datase...

I've got some causal ablation type stuff on my fork of the GitHub repo if that's helpful for you guys. TL;DR of as far as I got is that

our features are decent for e.g. concept erasure at early layers (on pythia-70m)
we can identify like a single feature which is responsible for IOI (by which I mean 'if you change this feature on the corrupted activation to its activation on the clean data you like 60% of the way to the behaviour on the clean data')
you almost certainly want to use a dictionary set for a model better than pythia-70m lol

#

Also if I were to do it again I would mean-center the activations for every layer

#

You also probably don't want to directly convert ACDC, the amount of graph edges is absolutely absurd for even moderate dictionary sizes

#

Oh! Also I ran into a bunch of annoyances with imperfect reconstructions @cosmic yarrow absolutely forsee FISTA at least partially solving those. I would train with FISTA like Yun et al if I were to do it again

cosmic yarrow Aug 28, 2023, 3:21 PM

#

bitter turtle Oh! Also I ran into a bunch of annoyances with imperfect reconstructions <@84608...

Sorry for not remembering, but did you all work with methods that allow for the dictionary dimension to exceed the hidden dim?

gilded merlin Aug 28, 2023, 3:39 PM

#

someone share the github link of this project, i have done some work in interpretability area, the project seems interesting and i want to work on it if there are some ideas to be tested

keen pivot Aug 28, 2023, 4:20 PM

#

gilded merlin someone share the github link of this project, i have done some work in interpre...

https://github.com/HoagyC/sparse_coding

GitHub

GitHub - HoagyC/sparse_coding: Using sparse coding to find distribu...

Using sparse coding to find distributed representations used by neural networks. - GitHub - HoagyC/sparse_coding: Using sparse coding to find distributed representations used by neural networks.

#

Though we should update the readme.

#

I have loads of ideas to test! Let me get to the office and I can send a list.

gilded merlin Aug 28, 2023, 4:23 PM

#

okies, mean while i get a hang of the project repository

bitter turtle Aug 28, 2023, 4:44 PM

#

cosmic yarrow Sorry for not remembering, but did you all work with methods that allow for the ...

Yes, we used ratios from 0.5-32x the number of hidden dimensions. However, we used a kind-of-shit linear autoencoder which was kind of overzealous with eliminating noise, and so we got imperfect reconstructions

#

#

We got these kind of curves in terms of activations-per-example against reconstruction loss.

#

FISTA should converge to better (more exact) solutions in the highly sparse regime, is my thinking

#

You can squint Very Hard and see our encoders as kind of being almost a single iteration of (L)ISTA, to give you an idea of how 'good' our encoders are

cosmic yarrow Aug 28, 2023, 4:48 PM

#

bitter turtle You can squint Very Hard and see our encoders as kind of being almost a single i...

yeah, makes sense that running several optimizations steps of ISTA or FISTA would help

keen pivot Aug 28, 2023, 5:15 PM

#

cosmic yarrow yeah, makes sense that running several optimizations steps of ISTA or FISTA woul...

I bet against this, but would love for someone to run the experiment!

bitter turtle Aug 28, 2023, 5:17 PM

#

keen pivot I bet against this, but would love for someone to run the experiment!

Interesting, what's your thought process?

keen pivot Aug 28, 2023, 5:18 PM

#

bitter turtle Interesting, what's your thought process?

Didn’t you do a multi layer LISTA?

bitter turtle Aug 28, 2023, 5:18 PM

#

I think that failed because of accursed convergence/not enough iterations etc etc.

keen pivot Aug 28, 2023, 5:19 PM

#

bitter turtle I think that failed because of accursed convergence/not enough iterations etc et...

Oh, then I no longer bet against this! Lol

bitter turtle Aug 28, 2023, 5:19 PM

#

I think FISTA with our current dicts would be good, less sure (but still pretty sure) about dicts trained with FISTA.

#

Oh lol

keen pivot Aug 28, 2023, 5:20 PM

#

bitter turtle I think FISTA with _our current dicts_ would be good, less sure (but still prett...

Initializing with our current dicts?

#

I was kind of expecting doing the KL thing to close the gap (at least for functional equivalence), but a better solver would be complementary to this.

bitter turtle Aug 28, 2023, 5:26 PM

#

keen pivot Initializing with our current dicts?

No, even with just using FISTA to generate the sparse codes

keen pivot Aug 28, 2023, 5:27 PM

#

bitter turtle No, even with just using FISTA to generate the sparse codes

And still just a linear layer for the decoder? I saw Yun do some hessian thing for that part as well

pallid current Aug 28, 2023, 5:27 PM

#

i still do have some feeling that our autoencoder methods should be helpful in that they should more closely track which features can be easily pulled out by a linear layer. though tbh the residual stream has so much capacity that it's quite likely this is barely an issue

bitter turtle Aug 28, 2023, 5:27 PM

#

keen pivot I was kind of expecting doing the KL thing to close the gap (at least for functi...

Hmm, I'm not so sure actually! I think it might close the gap slightly, in the lens of KL div, but I think there'll be some leftover

bitter turtle Aug 28, 2023, 5:28 PM

#

pallid current i still do have some feeling that our autoencoder methods should be helpful in t...

I'm kind of viewing it as more 'we are using sparsity to disentangle the latents' rather than anything mechanistic ATM tbh

bitter turtle Aug 28, 2023, 5:28 PM

#

keen pivot And still just a linear layer for the decoder? I saw Yun do some hessian thing f...

Yeah, just linear decoder, what was the 'Hessian thing'?

bitter turtle Aug 28, 2023, 5:31 PM

#

pallid current i still do have some feeling that our autoencoder methods should be helpful in t...

Yeah I also think the algo for feature extraction, if that is indeed what's happening in MLPs, could be slightly different. You could condition on certain features being 'off' rather than just eliminating noise with a bias, which might partially explain why untied might work better (do we see this? No idea, can't remember, maybe a little in certain circumstances)

pallid current Aug 28, 2023, 5:35 PM

#

btw one thing that i think we've underexplored so far is whether there's actually a maximum number of features that we tend to find. like we often see that with mlp we don't see the largest ratio having the larger number of active features. curious if that's eventually also the case with residual stream at some size, like it seems d(active_feats)/d(num_feats) is declining at 32x, some layers more than others, but we havent checked if it ever hits 0 for residual stream

#

running 64 and 96 on gpt2sm to test this, though should prob go back to pythia70m because the extra dims hurt a lot for this

keen pivot Aug 28, 2023, 5:39 PM

#

bitter turtle Yeah, just linear decoder, what was the 'Hessian thing'?

This code: https://github.com/zeyuyun1/TransformerVis/blob/main/sparsify_PyTorch.py#L16

Which is used here on the dictionary (ie decoder): https://github.com/zeyuyun1/TransformerVis/blob/main/train.py#L172

GitHub

TransformerVis/train.py at main · zeyuyun1/TransformerVis

Contribute to zeyuyun1/TransformerVis development by creating an account on GitHub.

GitHub

TransformerVis/sparsify_PyTorch.py at main · zeyuyun1/TransformerVis

Contribute to zeyuyun1/TransformerVis development by creating an account on GitHub.

bitter turtle Aug 28, 2023, 5:49 PM

#

Oh, I think that's just how they optimise the dictionary.

keen pivot Aug 28, 2023, 5:52 PM

#

@gilded merlin, just my general list of things-to-do:

Learn circuits for many target tasks: adversarial examples, chess/othello, in-context learning, deception, truthfulness, sycophancy, etc.

Like the dictionaries aren't perfect atm, but trying to extract circuits from real things will still inform what heuristics to use (then better dictionaries can be slotted in later).

Better dictionaries: FISTA stuff above (& I want to chat w/ some Harvard people here who do dictionary learning as their research once our paper's on arxiv) and KL-divergence penalty (this is what I'm doing this week).
Activation engineering using our learned dictionary features
Better automatic circuit detection (A) how to go forward (e.g. layer 3->4), (B) connect w/ dictionaries learned on MLP-out & Attn-out, (C) weight based connections (ie features in residual connect to the features in MLP, and weights connect them. Can you predict this from just the weights?)
Connecting circuits learned w/ datapoints. How do datapoints lead to learned features/dictionaries? This could even be paired up w/ developing Deep Learning theories since it's easier to develop theories when you have specific examples of circuit-formation.
Refactoring code for my manual interp stuff & make a standalone colab notebook that can load in a dictionary from e.g. hugging face
Optimization help: code in perplexity check every N batches, fix wandbd display (or some equivalent), have a changing l1 value to specify a set sparsity (ie features/token)
Updated Github w/ minimal code to run on a new model & look at it in details.

keen pivot Aug 28, 2023, 5:53 PM

#

bitter turtle Oh, I think that's just how they optimise the dictionary.

Optimize the dictionary for what? (I'm assuming reconstruction)

#

For this week, I'm going to get that KL thing working & get the perplexity check coded up as well.

gilded merlin Aug 28, 2023, 5:59 PM

#

@keen pivot does the experiment in the repository can be easily run on a colab

keen pivot Aug 28, 2023, 6:00 PM

#

gilded merlin <@360082080975290369> does the experiment in the repository can be easily run on...

For training a dictionary, no.

gilded merlin Aug 28, 2023, 6:00 PM

#

which of above tasks can be done on colab

keen pivot Aug 28, 2023, 6:00 PM

#

It was possible at one point, and it's on the to-do to make them so

#

A colab notebook is going to not be so great because you need other files.

#

I think (6) is the one for this then

gilded merlin Aug 28, 2023, 6:03 PM

#

keen pivot It was possible at one point, and it's on the to-do to make them so

yeah i know, but you do needed a gpu for this ?

keen pivot Aug 28, 2023, 6:04 PM

#

gilded merlin yeah i know, but you do needed a gpu for this ?

You very much need a GPU for it to go reasonably, and the Colab one should work

#

But the current repo is optimized for like multiple GPU's and having the repo loaded, which is doable to put in a colab

#

But will be difficult for pushing useful PR's

#

You could convert this file: https://github.com/loganriggs/sparse_coding/blob/main/lucia_and_lovis.ipynb

To a notebook. You need to have the dictionary loaded from pythia160m, and the autoencoders folder from: https://github.com/HoagyC/sparse_coding

and I'm unsure how to import that to a notebook.

GitHub

sparse_coding/lucia_and_lovis.ipynb at main · loganriggs/sparse_cod...

Contribute to loganriggs/sparse_coding development by creating an account on GitHub.

GitHub

GitHub - HoagyC/sparse_coding: Using sparse coding to find distribu...

Using sparse coding to find distributed representations used by neural networks. - GitHub - HoagyC/sparse_coding: Using sparse coding to find distributed representations used by neural networks.

gilded merlin Aug 28, 2023, 6:07 PM

#

so task 6 is for demonstration easiness basically

keen pivot Aug 28, 2023, 6:07 PM

#

Yep

#

I will say vast.ai is quite cheap compute & only takes a day or so to learn.

#

It would help!

gilded merlin Aug 28, 2023, 6:11 PM

#

Will start on that, if any problems come i will ping you

#

https://arxiv.org/abs/2307.01201

arXiv.org

Schema-learning and rebinding as mechanisms of in-context learning ...

In-context learning (ICL) is one of the most powerful and most unexpected
capabilities to emerge in recent transformer-based large language models
(LLMs). Yet the mechanisms that underlie it are poorly understood. In this
paper, we demonstrate that comparable ICL capabilities can be acquired by an
alternative sequence prediction learning method ...

#

Have you seen this paper, in above list you mentioned learning circuits for target tasks i.e in-context learning

#

this paper might give some insight or food for thought regarding in-context learning, i read it recently

bitter turtle Aug 28, 2023, 10:55 PM

#

pallid current running 64 and 96 on gpt2sm to test this, though should prob go back to pythia70...

When will this be done/how did it go?

pallid current Aug 28, 2023, 10:59 PM

#

finished now (for a couple of layers), havent had a chance to look at results yet, will v soon tho, currently getting the proper results for the correlation btwn interp scores and e.g. kurtosis and skew. took longer than expected cos needed a batched version of the moment calculations which was a little awkward

pallid current Aug 28, 2023, 11:33 PM

#

now rnning results, running it for lots of different chunks so its mega slow, results in about an hr

pallid current Aug 29, 2023, 12:10 AM

#

hmm the number of feats just keeping growing

#

#

will switch to pythia 70M and keep cranking it up

#

might have to keep turning the batches up as well, this is 32 chunks, and the time taken to converge to this shape gets higher as you increase the size

#

worried that the 10chunk 32x dicts were a bit undertrained

#

now setting off 32, 64, 128 and 256 on pythia70M for 64 chunks. time might be way too long and might have to settle for a single layer, results are generally v consistent across non final layers

#

hmm eta like 5 days :(( will restrict to layer 2

#

still will have to wait like 2 days for long enough 256x results

#

remind me on wednesday morn to check the 256x results lol, other ones will be done by morn

bitter turtle Aug 29, 2023, 5:43 PM

#

pallid current remind me on wednesday morn to check the 256x results lol, other ones will be do...

Holy shit pahaha; we should look into ways to optimise this for scaling purposes maybe

pallid current Aug 29, 2023, 6:58 PM

#

got some level of results from the larger dicts sizes. not seeing any reduction in the ability of the model to incorporate more and more features, even as the % of active features keeps falling. these result are a bit odd in that they dont show as much of a dropoff towards high or low values, which we see both the gpt2sm just above and also in the previous pythia70m runs.

#

#

256 is almost certainly undertrained (still running) this is after 32 chunks

bitter turtle Aug 29, 2023, 7:05 PM

#

How do they look on the reconstruction-sparsity plot

pallid current Aug 29, 2023, 7:19 PM

#

bitter turtle How do they look on the reconstruction-sparsity plot

OOM trying to answer this lmao

bitter turtle Aug 29, 2023, 7:31 PM

#

Eel

#

Eek

#

Take the mean over multiple batches or something?

pallid current Aug 29, 2023, 8:17 PM

#

sorry went for lunch but i can just take the sample size down

#

results are really annoying though bc they dont match up with the previous small batch runs

#

pain to see but there's no improvement at these sizes. you can see that 32-128 are on top of each other and 256 is undertrained

keen pivot Aug 29, 2023, 8:27 PM

#

Is that l1 in the legend? Edit: Wait it can't be cause we have different sparsities. What's in the legend? @pallid current

pallid current Aug 29, 2023, 8:37 PM

#

really dont understand why sparsity isnt improving

#

legend is ratio and the total number of feats

#

from 1k to 131k

bitter turtle Aug 29, 2023, 8:39 PM

#

pallid current pain to see but there's no improvement at these sizes. you can see that 32-128 a...

I really feel like we will see vastly different results if we use better encoders (like e.g. FISTA) and we'll get better translation from dict size to FVU/sparsity; I think maybe part of what's happening with our current models is that our current models need to be more zealous with their cutoff the more directions we have, and so they don't necessarily improve FVU.

#

Are these tied or untied?

bitter turtle Aug 29, 2023, 8:41 PM

#

bitter turtle I really feel like we will see vastly different results if we use better encoder...

Concretely I'd expect to see some increase in bias amount the larger we go, and also see better FVU-sparsity given dict size with untied models because then they can do stuff like 'activate if this other feature is not active'

pallid current Aug 29, 2023, 8:45 PM

#

bitter turtle Are these tied or untied?

tied

pallid current Aug 29, 2023, 8:51 PM

#

bitter turtle I really feel like we will see vastly different results if we use better encoder...

ok well if thats your guess we'd better start testing that - i thought you had some early results from FISTA a couple of months ago and it was underwhelming

bitter turtle Aug 29, 2023, 8:51 PM

#

That was LISTA and that's because it converged badly

#

Haven't actually tried regular non-neural-net FISTA

pallid current Aug 29, 2023, 8:52 PM

#

i dont understand why these larger models - with many more active features! - dont increase the bias and become more specific though

#

tfw i dont have a username for yann.lecun.com https://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf

bitter turtle Aug 29, 2023, 8:53 PM

#

Well, I guess if they did too much they'd have really shit reconstruction, because we use a ReLU for thingy

pallid current Aug 29, 2023, 8:56 PM

#

yeah but in that case we don't have a mechanism for why they would even bother turning on more features. like if it's not improving reconstruction loss, and there aren't fewer features active per token, then that's all of the losses so what's the value in these extr` feats?

keen pivot Aug 29, 2023, 8:57 PM

#

pallid current yeah but in that case we don't have a mechanism for why they would even bother t...

This is where the clustering experiment would be useful

bitter turtle Aug 29, 2023, 8:58 PM

#

pallid current yeah but in that case we don't have a mechanism for why they would even bother t...

Initialisation or something? Since the bias is always initialised to about 0 it might be closer to initialisation to spread everything out

#

Also think we should try this type of activation out further

bitter turtle Aug 29, 2023, 8:59 PM

#

pallid current tfw i dont have a username for yann.lecun.com https://yann.lecun.com/exdb/publis...

Smh what the hell

pallid current Aug 29, 2023, 9:01 PM

#

bitter turtle Also think we should try this type of activation out further

so like adding back in the bias but with a bit of additional smoothness? makes sense to me

bitter turtle Aug 29, 2023, 9:07 PM

#

Ye exactly

#

I think I have a thing on my branch/maybe your branch called Thresholding or something

keen pivot Aug 29, 2023, 9:15 PM

#

The KL stuff is going well. I think we can tradeoff some reconstruction loss for KL/perplexity, which I think is what we really want (ie features that have intuitive, causal effects)

bitter turtle Aug 29, 2023, 9:16 PM

#

bitter turtle Initialisation or something? Since the bias is always initialised to about 0 it ...

I guess to test this you could do runs with x biases initialised at 0 and the rest at higher values or something and see how it changes the ending no. Active features

pallid current Aug 29, 2023, 10:00 PM

#

bitter turtle I think I have a thing on my branch/maybe your branch called `Thresholding` or s...

oh that rings a bell. i also have a bias-added-back (without smoothing) lying around somewhere, i dont think properly tested

keen pivot Aug 29, 2023, 10:17 PM

#

Note: I need to add to my future work list: monosemanticity metric.

If we had a dataset that had very basic features (maybe just token level) and made a metric on how monosemantic it is (maybe defined by a weighted histogram measure?), this may also inform how monesemantic other types of features are.

It would also be a cheap test for checking if our dictionaries at different sparsities/hyperparams are more monosemantic

pallid current Aug 29, 2023, 10:40 PM

#

made a few notes from the textbook im reading now of things particularly relevant to sparse coding, welcome to read here, super messy, will prob keep adding to it https://docs.google.com/document/d/1MzSS2EFXtva5uTWxl7KGjSs_t3Z5mv7yoM2bMsZjATs/edit?usp=sharing

Google Docs

high dim models textbook notes

Notes for sparse coding on High Dimensional Data Analysis with Low Dimensional Models by John Wright and Yi Ma Qu, Zhai, Li et al - Analysis of the Optimization Landscapes for Overcomplete Representation Learning They study sparse overcomplete dictionary learning, show that problems can be formu...

bitter turtle Aug 29, 2023, 11:12 PM

#

I think I also want to scale up the concept erasure stuff to a more capable model, do you think we could do a run on pythia-410m? At maybe like 4x dim size and a few L1 levels over all layers?

#

Or, if not all layers, maybe every other ome

pallid current Aug 29, 2023, 11:41 PM

#

bitter turtle I think I also want to scale up the concept erasure stuff to a more capable mode...

yeah def doable, most of the gpus are free atm, will set off

pallid current Aug 30, 2023, 12:29 AM

#

ok currently running 8 l1 sweep of pyth410, 80 chunks, layers [0,2,4,6,8,10,12]

keen pivot Aug 30, 2023, 1:54 PM

#

pallid current made a few notes from the textbook im reading now of things particularly relevan...

Oh this is cool! Thanks for sharing

bitter turtle Aug 31, 2023, 8:02 AM

#

I think maybe we should check out using VAEs (or most likely, something similar with also-useful subcomponents) for this, like we might be able to achieve more robust/in-distribution activation engineering by sampling from e.g. the latent distribution conditioned on the 'deceptiveness' direction being higher than some amount (obviously an oversimplification, you'd also want to preserve other properties of an activation if you were to edit it, and you'd probably find directions in latent-space to preserve semi-autonomously)

(Also have some friends in Bristol that have been maybe been wanting to work with this for a bit, @thorny cypress and @coarse flint)

#

I can't actually think of that many benefits that doing a search over conditioned latent space has over searching over e.g. sphered data for a point with a certain magnitude in a certain direction and minimum distance to some other target point like you might do in concept editing

#

Unless, like, "covariance doesn't capture sparsity well" or something

bitter turtle Aug 31, 2023, 9:05 AM

#

Lol #1146607658179252286; Time to read all papers posted in this channel ever

bitter turtle Aug 31, 2023, 12:14 PM

#

Right it seems the exact thing you want here is the normalising flow autoencoder that @opal basin mentioned ages ago.

opal basin Aug 31, 2023, 12:14 PM

#

ohh?

#

i never did try guided sampling with it

#

but it might work!

#

also you can try training a diffusion model in a normal autoencoder/low-beta VAE latent space to sample from it and guiding the diffusion model sampling process to update the prior from the diffusion model with the evidence from the criterion, guided sampling totally works for diffusion

glass tinsel Aug 31, 2023, 6:27 PM

#

wait did you guys do the end to end training yet

glass tinsel Aug 31, 2023, 6:28 PM

#

bitter turtle Right it seems the exact thing you want here is the normalising flow autoencoder...

have you seen the BERTflow paper https://arxiv.org/abs/2011.05864

arXiv.org

On the Sentence Embeddings from Pre-trained Language Models

Pre-trained contextual representations like BERT have achieved great success
in natural language processing. However, the sentence embeddings from the
pre-trained language models without fine-tuning have been found to poorly
capture semantic meaning of sentences. In this paper, we argue that the
semantic information in the BERT embeddings is not...

bitter turtle Aug 31, 2023, 6:29 PM

#

glass tinsel wait did you guys do the end to end training yet

@keen pivot was working on it. In a week or so when I get back from holiday I want to do more with better autoencoders since ours are quite inaccurate at the moment.

bitter turtle Aug 31, 2023, 6:29 PM

#

glass tinsel have you seen the BERTflow paper https://arxiv.org/abs/2011.05864

No, interesting

keen pivot Aug 31, 2023, 6:58 PM

#

glass tinsel wait did you guys do the end to end training yet

Yep! Very basic result:
The dictionaries have much better perplexity for the same sparsity (though at the cost of reconstruction loss), but doesn't 100% match the original model's perplexity.

glass tinsel Aug 31, 2023, 6:58 PM

#

who cares about reconstruction loss 😛

keen pivot Aug 31, 2023, 6:58 PM

#

I haven't checked the individual features that are different between them.

#

One general problem for our work is determining the "goodness" of the model.

I'm currently working on an automated monosemanticity metric for both the input activations & output. It's only on single-token features, but monosemantic on single-token-level features may imply monosemantic on other features.

I'm hoping this helps us actually measure what dictionaries are "better" or not.

bitter turtle Aug 31, 2023, 7:13 PM

#

keen pivot Yep! Very basic result: The dictionaries have much better perplexity for the sam...

Hehe I'm still predicting significant perf improvements (on both perplexity and reconstruction loss) when training better autoencoders, ours aren't even capable of accurately reconstructing data at the moment. I do worry that we'll get some accursed less-monosemantic solution if we switch autoencoder though, maybe something about the current setup incentivises monosemanticity compared to using FISTA for a given sparsity level

keen pivot Aug 31, 2023, 8:07 PM

#

bitter turtle Hehe I'm still predicting significant perf improvements (on both perplexity and ...

I think if you get better perplexity, reconstruction, AND similar sparsity, I expect it to be monosemantic.

bitter turtle Aug 31, 2023, 8:36 PM

#

Yes it's kind of a small worry, but I'm just slightly skeptical of the 'sparsity induces monosemanticity' story ATM I guess, partially because of the fact that bigger dicts don't work as well as you might expect ( @pallid current have you compared 'mean autointerp' or some weighting of that between dictionary sizes?)

pallid current Aug 31, 2023, 8:38 PM

#

yeah on the pythia70M i ran autointerp over different dict sizes, it's in the drat appendix atm

#

general pattern was for the first couple of layers, interp scores didnt changes with dict size and for middle-latish the interp generally got worse

#

which is correlated with there being clearer improvements in sparsity/fvu when increasing dict size for those early layers, while the improvement is minimal from about layer 2

keen pivot Aug 31, 2023, 9:10 PM

#

bitter turtle Yes it's kind of a small worry, but I'm just _slightly_ skeptical of the 'sparsi...

Can you explain the connection with bigger dicts bad -> sparsity induces monosemanticity?

And to be clear, my claim is that for the same reconstruction loss & perplexity, a sparser solution is more monosemantic.

bitter turtle Aug 31, 2023, 9:29 PM

#

I now retract that, I got confused.

rancid summit Aug 31, 2023, 9:37 PM

#

are there plots of the pre activation distribution of values for some arbitrary autoencoder neuron, along with the negative biases (ie where the ReLU truncates it)?

bitter turtle Aug 31, 2023, 9:38 PM

#

Not currently, we should definitely do that.

#

OpenAI has a different approach for finding directions that also used that visualisation, it'd be interesting to compare the location of our negative bias to theirs, however they set it.

bitter turtle Aug 31, 2023, 9:46 PM

#

keen pivot Can you explain the connection with bigger dicts bad -> sparsity induces monosem...

Ok, solid claims

I think some of the 'sparsity induces monosemanticity' phenomenon is 'our measures of monosemanticity are slightly gameable by sparse activations'
I think that switching to FISTA will initially reveal solutions with higher sparsity and reconstruction accuracy, but lower monosemanticity, because FISTA is just a much better 'encoder' (similarly to the topk-pca thing from before)

bitter turtle Aug 31, 2023, 9:53 PM

#

rancid summit are there plots of the pre activation distribution of values for some arbitrary ...

Ffs why are you Leo gao I thought you were two different people, very sorry for confusing!!

rancid summit Aug 31, 2023, 9:54 PM

#

no worries

keen pivot Aug 31, 2023, 9:55 PM

#

First attempt at a monosemanticity measure. The peak is at 3e-4, which doesn't seem right to me. Though I think I'm too hungry to explain what experiment I did specifically

pallid current Aug 31, 2023, 9:59 PM

#

token entropy or something?

keen pivot Aug 31, 2023, 10:02 PM

#

(Okay, getting food in 5 minutes)

I'm measuring monosemanticity on a token level, and looking at the features that activate for single-tokens only (e.g. periods, newlines, commas, etc). I assume that all LLMs will dedicate features to these, even if you scale, you'll get feature splitting (ie a feature that activated for all periods now splits into two features that activates for subsets of periods).

So I find these features & count the number of tokens they activate on for that single token, divided by total number of non-zero activations. Weighted means I account for how much that feature activates (e.g. "." with activation 8).

keen pivot Aug 31, 2023, 11:21 PM

#

Okay, for next todo's:

Verify these single-token features are indeed monosemantic in the low-l1 regime (in case of bugs in code)
Check other features to see if they're monosemantic across l1's

Would be nice: have guaranteed features in a slightly more complex model (like TRACR code in superposition decomposed), then we can check if those features are monosemantic for a given l1 value.

pallid current Sep 1, 2023, 4:21 PM

#

got centered data running, trying an r4 l3 run at the moment

polar violet Sep 1, 2023, 4:21 PM

#

logan was v cool irl

#

if people are curious

#

v chadgoose energy

bitter turtle Sep 1, 2023, 4:33 PM

#

lmao

bitter turtle Sep 1, 2023, 4:36 PM

#

pallid current got centered data running, trying an r4 l3 run at the moment

How are you doing this btw

pallid current Sep 1, 2023, 4:37 PM

#

im centering it when running setup_data. also normalizing variance. currently the means/stds aren't saved anywhere which needs to change but they're just the means and stds of the first chunk

bitter turtle Sep 1, 2023, 4:39 PM

#

As in, proper sphering/whitening? Are you decorrelating the data as well @pallid current?

pallid current Sep 1, 2023, 4:47 PM

#

bitter turtle As in, proper sphering/whitening? Are you decorrelating the data as well <@56694...

no, just subtract mean and divide by std

bitter turtle Sep 1, 2023, 5:10 PM

#

Hmm

#

I think you should probably be decorrelating as well. You can use the BatchedPCA to implement efficientish sphering

pallid current Sep 1, 2023, 5:16 PM

#

i'm open to trying it but don't understand the transformation well enough to have a feel for how that would interact with the features

#

gonna have to mostly pass it onto you, i'm on holiday from tomorrw and its MATS final presentations today

keen pivot Sep 1, 2023, 5:59 PM

#

keen pivot Okay, for next todo's: 1. Verify these single-token features are indeed monosema...

It is true that these high sparsity (e.g. 350/500 d_model) dictionaries have single-token level monosemantic features. It is also true that many, many other features are polysemantic. I've now got a few more ideas:

Different monosemantic datasets - build a dataset of a feature that's more complex than single-token level. single-token level may be low-hanging fruit for the model (especially since I'm doing quite common words). So doing more complex features may be a better measure of monosemanticity in general.
Measuring how much a feature is just copying a dimension in the residual stream - residual stream is (mostly) polysemantic. So we can figure out how much a feature's [activations/variation] can be explained by 1 neuron basis element (I feel like there's an established way of doing this, but not familiar with it). Additionally, instead of looking at 1 feature at a time, we could look at specific datapoints & see if their encoding is more like an identity or not.
For a given sparsity, S, that means a token has S features activating on average. We could qualitatively examine a few example sentences to find the features that are most reconstructing it. e.g. For the sentence " Of the 5 donuts, he ate all 5", we could find all features that activate for the last token (i.e. " 5"), and make statements like "30% of reconstruction is 'single digits' feature, 25% is 'repeated token', etc". Better dictionaries will just make more sense here.

bitter turtle Sep 1, 2023, 8:11 PM

#

pallid current i'm open to trying it but don't understand the transformation well enough to hav...

Like, normalising variance component-wise is definitely not a basis-invariant operation and I guess I don't want to induce any privileged basis on the activations

pallid current Sep 1, 2023, 8:12 PM

#

hmm dyou reckon we should either just mean center or fully whiten then? i was gonna just center to begin but then i thought about those outlier dimensions and added the stds

bitter turtle Sep 1, 2023, 8:13 PM

#

I mean I'd probably want to try both tbh

pallid current Sep 1, 2023, 8:14 PM

#

also the run failed at like 64 epochs for some reason but there's dicts in /mnt/ssd-cluster/pythia70m_centered/.../_63 if anyone wants to check on them

bitter turtle Sep 1, 2023, 8:15 PM

#

Could you run one on just mean-centered data before you go on holiday I guess?

pallid current Sep 1, 2023, 8:15 PM

#

yup can do

bitter turtle Sep 1, 2023, 8:15 PM

#

pallid current also the run failed at like 64 epochs for some reason but there's dicts in `/mnt...

Lol how come

pallid current Sep 1, 2023, 8:16 PM

#

not sure, no proper error, it just kinda hung, said something about a wandb network error but not sure if that's symptom or cause

pallid current Sep 1, 2023, 8:16 PM

#

bitter turtle Could you run one on just mean-centered data before you go on holiday I guess?

yeah i can do that, though rn it's slightly slow to run any decent analysis on it with new chunks etc, you'll prob have to do that

pallid current Sep 1, 2023, 8:33 PM

#

speaking of slow feedback loops, did anyone look at the gpt-2-small results? they look really good!

#

clear gains up to 96x ratio, and look at the y-axis - it starts at 0.02!

bitter turtle Sep 1, 2023, 8:34 PM

#

That the fuck

#

This shit needs reporting and investigating what

pallid current Sep 1, 2023, 8:35 PM

#

i know! im pretty shocked

#

need to get the interp on it asap

#

cant believe im gonna go on holiday to NYC instead of grinding on this 😭

bitter turtle Sep 1, 2023, 8:36 PM

#

Initial hypothesis

pythia is incredibly uncentered, gpt2smal isn't maybe? This seems easy to test, just look at the means of everything

#

I remember Neel Nanda saying something about this, this seems low-cost-to-test-and-potentially-very-high-value

pallid current Sep 1, 2023, 8:37 PM

#

yeah hang on wasn't there some plot of the centeredness of different modelss

#

cant immediately find anything for pythia, gpt2 looks well centered https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights

An exploration of GPT-2's embedding weights — LessWrong

I wrote this doc in December 2021, while working at Redwood Research. It summarizes a handful of observations about GPT-2-small's weights -- mostly t…

bitter turtle Sep 1, 2023, 8:39 PM

#

I vaguely remember looking at pythia s and it wasn't, might be hallucinating tho

#

I'm having strong words with my past self if the whole problem was mean centeredness

pallid current Sep 1, 2023, 8:44 PM

#

yeah me too this is a big mad

#

also looks like the mlp results are helped by it

bitter turtle Sep 1, 2023, 8:44 PM

#

That doesn't even make any sense 😕

pallid current Sep 1, 2023, 8:44 PM

#

this is latest result, 4x ratio

bitter turtle Sep 1, 2023, 8:45 PM

#

Time to rewrite everything

pallid current Sep 1, 2023, 8:45 PM

#

this is previous, mixed ratios

bitter turtle Sep 1, 2023, 8:45 PM

#

How long you in nyc for

pallid current Sep 1, 2023, 8:45 PM

#

not as stark but still big diff

#

10 days

bitter turtle Sep 1, 2023, 8:46 PM

#

A rewrite before ICLR is possible then maybe

pallid current Sep 1, 2023, 8:46 PM

#

yes for sure

#

we know what to do

bitter turtle Sep 1, 2023, 8:50 PM

#

Phew

bitter turtle Sep 1, 2023, 9:00 PM

#

pallid current also looks like the mlp results are helped by it

I predict these directions are less useful meaning less monosemantic, otherwise my mental model for features in MLPs is broken, but tbh I didn't trust it before anyway

pallid current Sep 1, 2023, 9:13 PM

#

Yeah agreed from a theoretical perspective I don't understand it in the mlp, except maybe just to say that the mlp may be used to perform arbitrary function approx that isn't very tied to the neurons themselves, and this still exhibits a sparse structure

#

But I think that's mostly already true for our normal sc on mlps

#

Will have to actually interp and see

#

set off a run at ratio [4,8,16,32] residual stream centered all layers

#

still unit variance but not decorrelated (soz)

bitter turtle Sep 1, 2023, 9:34 PM

#

I'm kinda worried that doing unit variance without decorrelating with squish important info and maybe harm perf possibly idk

pallid current Sep 1, 2023, 9:55 PM

#

will have a quick look after the first model is done and see if it looks decent, will cancel and rerun without altering variance if not, otherwise will try set off in a couple days or maybe airport lol

bitter turtle Sep 1, 2023, 10:00 PM

#

Enjoy NYC btw

plucky bay Sep 1, 2023, 10:42 PM

#

bitter turtle Initial hypothesis - pythia is incredibly uncentered, gpt2smal isn't maybe? This...

idly wondering whether

GPT-J and the Pythia-v0 models are also uncentered
whether gpt-neo models are centered

because something that pops out to me is the initialization scheme used for gpt-j and Pythia (section 2.1.3 of the neox-20b paper) is different than standard. this is just a drive-by observation, feel free to disregard 😅

bitter turtle Sep 1, 2023, 11:04 PM

#

plucky bay idly wondering whether 1. GPT-J and the Pythia-v0 models are also uncentered 2. ...

Ah, no that maybe rings a bell 🤔 might well be the reason, we shall see!

pallid current Sep 2, 2023, 5:14 PM

#

ok got some results back from large centered runs (including unit var, not fully whitened) on pythia70m, first impressions are that it's not doing anything too crazy

#

even on l5 surprisingly

#

i think the actual takeaway from the last day or two might be that gpt2sm >> pythia70m

#

plane is boarding now no time to test whitening but set off a run without zerovar just to compare

#

priority should be to run more tests on those large gpt2sm dicts tho imo

bitter turtle Sep 2, 2023, 7:04 PM

#

Second hypothesis: something something outliers. Do they differ significantly between the different models?

keen pivot Sep 2, 2023, 7:13 PM

#

pallid current clear gains up to 96x ratio, and look at the y-axis - it *starts* at 0.02!

Wait, wtf. I can look at these, and check their perplexity

keen pivot Sep 2, 2023, 7:13 PM

#

bitter turtle Second hypothesis: something something outliers. Do they differ significantly be...

Hypothesis on the difference in FVU between Pythia and GPT small?

#

They both have outliers, but I think Pythia also had them for the first delimiter (eg period/newline) but not GPT small

bitter turtle Sep 2, 2023, 7:30 PM

#

What about their magnitude

#

I guess I'd also be interested in FVU against sphered when trained on sphered

pallid current Sep 2, 2023, 7:45 PM

#

WAIT ughhh deleted the graphs above they are bs, i didnt pass the argument to the HF activation function only the baukit one 🤦‍♂️

pallid current Sep 2, 2023, 8:57 PM

#

norm(mean) by layer:

#

still worried i've done something wrong somehow but redone it with centered data (no unit variance this time) and not seeing any diff in l5 perf

keen pivot Sep 3, 2023, 9:17 PM

#

pallid current clear gains up to 96x ratio, and look at the y-axis - it *starts* at 0.02!

I'm not getting crazy low reconstruction loss to match this graph

bitter turtle Sep 3, 2023, 9:18 PM

#

Maybe crisis of confusion averted

#

Still think that philosophically speaking we should be centering/allowing learned center points but initialise to mean-center

keen pivot Sep 3, 2023, 9:35 PM

#

I did reconstruction, not FVU, but they should be similar

pallid current Sep 3, 2023, 9:35 PM

#

@keen pivot hahaha wait how? I can't go back at check the code rn but i dont even know what bug would get the fvu wrong

#

Unless the fvu isn't normalized properly when switching to gpt2sm?

#

Tho I think the most interesting thing was the fact that 96x was better than 64x, do you still see that?

keen pivot Sep 3, 2023, 9:37 PM

#

pallid current Tho I think the most interesting thing was the fact that 96x was better than 64x...

I have not noticed more scaling is better for perplexity at least!

bitter turtle Sep 3, 2023, 9:38 PM

#

pallid current Tho I think the most interesting thing was the fact that 96x was better than 64x...

Wasn't that also true for certain layers of pythia though?

keen pivot Sep 3, 2023, 9:38 PM

#

Oh, I'm not testing on data it was trained on, so maybe!

pallid current Sep 3, 2023, 9:38 PM

#

Hmm yeah layer 0 and 1 so yeah not that surprising for layer 2 gpt2sm tbf

keen pivot Sep 3, 2023, 9:44 PM

#

I'm doing layer 4

keen pivot Sep 3, 2023, 10:19 PM

#

pallid current Hmm yeah layer 0 and 1 so yeah not that surprising for layer 2 gpt2sm tbf

So the original image was just layers 0 & 1?

pallid current Sep 3, 2023, 10:34 PM

#

Original was 2 I'm pretty sure, there's only 1 layer which has the big dicts i think?

keen pivot Sep 3, 2023, 11:12 PM

#

pallid current Original was 2 I'm pretty sure, there's only 1 layer which has the big dicts i t...

Layers 2/4/6/8 all have it!

#

Not the best graph, but here's perplexities for layer 4

#

#

Oh note: I don't think the first column is original, because it shouldn't change I think (unless I'm shuffling data)

pallid current Sep 4, 2023, 12:01 AM

#

Wait so did you try to replicate my original L2 Graph?

keen pivot Sep 4, 2023, 3:38 AM

#

pallid current Wait so did you try to replicate my original L2 Graph?

Nope! It’s just what I had running and it finished running.

#

I’m hoping tomorrow we’ll get it figured out

pallid current Sep 4, 2023, 3:39 AM

#

keen pivot I’m hoping tomorrow we’ll get it figured out

Nice, sorry to drop a bomb and then go on hol lol

keen pivot Sep 4, 2023, 4:30 PM

#

Figured it out: gpt2 has large activations, so variance is large, so FVU is smaller (since FVU = MSE/Var)

bitter turtle Sep 4, 2023, 4:40 PM

#

keen pivot Figured it out: gpt2 has large activations, so variance is large, so FVU is smal...

Surely the MSE would be similarly scaled though? What is 'median activation' here exactly? Median activation norm?

keen pivot Sep 4, 2023, 4:41 PM

#

I just got the batch at layer N, and took the median

bitter turtle Sep 4, 2023, 4:41 PM

#

Elementwise?

keen pivot Sep 4, 2023, 4:41 PM

#

bitter turtle Surely the MSE would be similarly scaled though? What is 'median activation' her...

I don't think there was much difference in MSE between Pythia & gpt2 small. Can check

keen pivot Sep 4, 2023, 4:42 PM

#

bitter turtle Elementwise?

Median of everything, lol

bitter turtle Sep 4, 2023, 4:42 PM

#

Ok, so I'm quite confused as to why that's on the graph? Like, what are you trying to show with it? (Just slightly confused 😅)

keen pivot Sep 4, 2023, 4:43 PM

#

Just general statistics. I don't know what I'm doing

#

But ya, I think max-activation -> high variance is the thing

bitter turtle Sep 4, 2023, 4:44 PM

#

keen pivot I don't think there was much difference in MSE between Pythia & gpt2 small. Can ...

I'd be very surprised if our autoencoders were significantly not scale-invariant, unless something something initialization magnitude or whatever.

#

Like, maybe get a batch of pythia data, scale so that mean pythia activation mag = mean gpt2small activation mag, and see what happens to the FVU

keen pivot Sep 4, 2023, 4:45 PM

#

Like the difference between a variance of 35 & 0.5 is 70x, which I think fully explains the FVU difference

keen pivot Sep 4, 2023, 4:46 PM

#

bitter turtle Like, maybe get a batch of pythia data, scale so that mean pythia activation mag...

I think just dividing by 70 in this case is also correct thing from my above message

bitter turtle Sep 4, 2023, 4:46 PM

#

keen pivot Like the difference between a variance of 35 & 0.5 is 70x, which I think fully e...

I'm really confused as to how that could be the case.

bitter turtle Sep 4, 2023, 4:46 PM

#

keen pivot I think just dividing by 70 in this case is *also* correct thing from my above m...

Uh, sure, try that

keen pivot Sep 4, 2023, 4:57 PM

#

bitter turtle Uh, sure, try that

bitter turtle Sep 4, 2023, 4:58 PM

#

What the hell

keen pivot Sep 4, 2023, 4:58 PM

#

You're right. Hoagy's results are more like FVU/20

bitter turtle Sep 4, 2023, 4:58 PM

#

Oh I misread this

#

Could you plot the unscaled things as well?

#

Wait, are you retraining?

keen pivot Sep 4, 2023, 4:59 PM

#

#

This is just the ratio 6 pythia one in /mnt

bitter turtle Sep 4, 2023, 5:00 PM

#

Wait, what are the lines here?

keen pivot Sep 4, 2023, 5:00 PM

#

FVU of pythia-70m layer 2

bitter turtle Sep 4, 2023, 5:00 PM

#

But what are the different lines?

keen pivot Sep 4, 2023, 5:01 PM

#

FVU, FVU/20, FVU/70

bitter turtle Sep 4, 2023, 5:02 PM

#

Ok, so I think maybe you misunderstood what I meant;
I'm confused as to why FVU would change so much with activation mag, so I wanted to test this by training autoencoders on scaled activations but the same underlying dataset, and see if anything changes

keen pivot Sep 4, 2023, 5:02 PM

#

I should be able to plot MSE from the pythia one & gpt2 small & gpt2 small will be 3x better.

keen pivot Sep 4, 2023, 5:03 PM

#

bitter turtle Ok, so I think maybe you misunderstood what I meant; I'm confused as to why FVU ...

Wait, why? isn't FVU calculated by MSE/var & var is affected by large values?

bitter turtle Sep 4, 2023, 5:03 PM

#

bitter turtle Ok, so I think maybe you misunderstood what I meant; I'm confused as to why FVU ...

To test the hypothesis that the change in magnitude is the thing that is causing the error change, and not anything else structural

bitter turtle Sep 4, 2023, 5:03 PM

#

keen pivot Wait, why? isn't FVU calculated by MSE/var & var is affected by large values?

But so is MSE

keen pivot Sep 4, 2023, 5:03 PM

#

Oh, I think I have access to the scaled Pythia ones

keen pivot Sep 4, 2023, 5:04 PM

#

bitter turtle But so is MSE

Not if the large values are explained by outlier dimensions which the dictionary is super good at capturing

bitter turtle Sep 4, 2023, 5:04 PM

#

keen pivot Oh, I think I have access to the scaled Pythia ones

I think it's probably better to scale by average magnitude tbh, since the scaling this is not rotation invariant

bitter turtle Sep 4, 2023, 5:05 PM

#

keen pivot Not if the large values are explained by outlier dimensions which the dictionary...

Ok, so

I was counting this as a structural difference (i.e., more of the norm is in the outlier dimension)
could we look at the outlier dimensions?

#

Sorry for the confusion 😅

keen pivot Sep 4, 2023, 5:06 PM

#

bitter turtle I think it's probably better to scale by average magnitude tbh, since the scalin...

So the pythia-70m_centered in /mnt is bad because it's not scaled correctly?

keen pivot Sep 4, 2023, 5:06 PM

#

bitter turtle Ok, so - I was counting this as a structural difference (i.e., more of the norm ...

What statistic on outlier dimensions?

bitter turtle Sep 4, 2023, 5:07 PM

#

keen pivot What statistic on outlier dimensions?

Hmm 🤔 maybe norm of max dimension/norm of activation, or look at the distributions for dimensions in the normal basis?

bitter turtle Sep 4, 2023, 5:08 PM

#

keen pivot So the pythia-70m_centered in /mnt is bad because it's not scaled correctly?

Yeah, it wasn't uncorrelated, so it could have squished things strangely compared to the outlier. However, I think this is maybe fine since the outliers are mostly basis-aligned?

keen pivot Sep 4, 2023, 5:09 PM

#

bitter turtle Yeah, it wasn't uncorrelated, so it could have squished things strangely compare...

It's pretty cheap to look at it now and do stuff with it.

#

I'm not confident in me quickly setting off a run making it uncorrelated

bitter turtle Sep 4, 2023, 5:12 PM

#

I'm back home this evening, I could set one off maybe

keen pivot Sep 4, 2023, 5:13 PM

#

bitter turtle Hmm 🤔 maybe norm of max dimension/norm of activation, or look at the distributi...

Does normal basis mean just residual basis?

bitter turtle Sep 4, 2023, 5:13 PM

#

Yeah

keen pivot Sep 4, 2023, 5:15 PM

#

bitter turtle Sep 4, 2023, 5:17 PM

#

What are the axies here?

keen pivot Sep 4, 2023, 5:17 PM

#

They both have only like 7-10 extreme values.

keen pivot Sep 4, 2023, 5:17 PM

#

bitter turtle What are the axies here?

Histogram: x is activation

bitter turtle Sep 4, 2023, 5:18 PM

#

Ok! So this seems like a significant difference. Pythia's outliers are waaaay smaller in terms of standard deviations outside mean

keen pivot Sep 4, 2023, 5:19 PM

#

bitter turtle Ok! So this seems like a significant difference. Pythia's outliers are _waaaay_ ...

So the hypothesis of "Sparse autoencoders capture the outlier dimensions well, so FVU will change drastically depending on outlier dimension std"?

bitter turtle Sep 4, 2023, 5:20 PM

#

Sure, something like that.

keen pivot Sep 4, 2023, 5:20 PM

#

So no holy grail by just centering data?

bitter turtle Sep 4, 2023, 5:20 PM

#

Did you find perplexity changes significantly between gpt-2 and pythia?

bitter turtle Sep 4, 2023, 5:21 PM

#

keen pivot So no holy grail by just centering data?

No, unf I don't think

Still think we should be doing this though

keen pivot Sep 4, 2023, 5:22 PM

#

pallid current this is latest result, 4x ratio

Hoagy has an image showing some improvement doing it.

bitter turtle Sep 4, 2023, 5:22 PM

#

bitter turtle Did you find perplexity changes significantly between gpt-2 and pythia?

Trying to work out whether downgrading the importance of getting the outliers exactly right, i.e. by whitening/sphering, is a reasonable thing to do

bitter turtle Sep 4, 2023, 5:23 PM

#

keen pivot Hoagy has an image showing some improvement doing it.

Yeah I think it'll definitely improve perf, but I also don't think it fully explains (or even mostly explains, given the results you just presented) the improvement

pallid current Sep 4, 2023, 5:28 PM

#

keen pivot Hoagy has an image showing some improvement doing it.

I think given the other results which didn't seem to show much improvement we should be careful to plot these on the same graph

#

So hard to really read on diff graphs with diff axes

keen pivot Sep 4, 2023, 5:29 PM

#

pallid current I think given the other results which didn't seem to show much improvement we sh...

What are "these"?

pallid current Sep 4, 2023, 5:29 PM

#

Those fvu vs sparsity graphs you linked that I sent a couple days ago

bitter turtle Sep 4, 2023, 5:33 PM

#

I can do a comparison of KL-div under various transformations (mean-centering, sphering) on pythia when I get back

#

Broadly think KL-div is the thing we should focus on minimising

keen pivot Sep 4, 2023, 5:34 PM

#

bitter turtle Broadly think KL-div is the thing we should focus on minimising

KL-div stuff worked on the gpt-2 models pretrained.

keen pivot Sep 4, 2023, 5:36 PM

#

bitter turtle Did you find perplexity changes significantly between gpt-2 and pythia?

It's a signifant difference! Let me get a better graph between the two.

bitter turtle Sep 4, 2023, 5:42 PM

#

Trouble is, I also don't know what the reasonable comparison for relative perplexity diff between models is ://

keen pivot Sep 4, 2023, 5:52 PM

#

bitter turtle Trouble is, I _also_ don't know what the reasonable comparison for relative perp...

I think they're close enough in this case that you can compare approximate apples-to-apples

#

@bitter turtle, pretty big differences!

#

Oh, nevermind. I'm such a loooser. I need to compare by sparsity

bitter turtle Sep 4, 2023, 5:54 PM

#

Lol

#

Also put 'original' at the other end maybe

#

(purely aesthetic request)

keen pivot Sep 4, 2023, 5:55 PM

#

bitter turtle (purely aesthetic request)

You're just objectively correct though

bitter turtle Sep 4, 2023, 5:56 PM

#

I mean the objectively correct thing to do would be axhline or something

keen pivot Sep 4, 2023, 5:56 PM

#

Oh ya, that works too

bitter turtle Sep 4, 2023, 5:56 PM

#

keen pivot I think they're close enough in this case that you can compare approximate apple...

Hmm yeah pbbly

keen pivot Sep 4, 2023, 6:11 PM

#

Okay, I think I want to plot by both sparsity & sparsity/d_model, but I expect them to mostly be the same

#

Looks good!

#

Like very good!

#

This was run on ~250k tokens for calculating perplexity

#

#

Note on KL: how is the model normally trained w/ EOT tokens? Is it masked? Does this effect how we'd want our autoencoders to have low KL-div with the model, or is this a total nothing-burger of a concern?

#

I could also look at datapoints that the reconstructed model is worse at predicting & see if there's some statistic that separates them. For example, maybe it is mostly high activating datapoints.

bitter turtle Sep 4, 2023, 7:34 PM

#

keen pivot

ok this is like the opposite of what I expected maybe? kind of implies we shouldn't be sphering things

keen pivot Sep 4, 2023, 7:45 PM

#

bitter turtle ok this is like the opposite of what I expected maybe? kind of implies we _shoul...

Can you elaborate?

bitter turtle Sep 4, 2023, 8:07 PM

#

keen pivot Can you elaborate?

If we sphere, it downweighs the importance of the outlier dims wrt MSE, and maybe the above graph shows that outliers relatively more important since the GPT2 one preserves them better or something?

#

Obvs will actually test

keen pivot Sep 4, 2023, 10:32 PM

#

bitter turtle If we sphere, it downweighs the importance of the outlier dims wrt MSE, and mayb...

I think we can see the MSE for just the outlier dimension, averaged over the batch & compare w/ GPT2 & Pythia. Though I'm unsure how to pull out just the outlier dimension component of MSE.

bitter turtle Sep 4, 2023, 10:39 PM

#

Yeah when I set it off I'm going to log that

#

Pbbly tmmrw got home too late

#

I'm going to approximate by 'FUV for largest activating component'

bitter turtle Sep 4, 2023, 11:07 PM

#

If we're looking at this from the perspective that L1 acts to disentangle latents maybe it'd be interesting to implement something like https://arxiv.org/abs/2205.05862 with a sparse autoencoder

arXiv.org

AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for ...

Variational Auto-Encoder (VAE) has become the de-facto learning paradigm in
achieving representation learning and generation for natural language at the
same time. Nevertheless, existing VAE-based language models either employ
elementary RNNs, which is not powerful to handle complex works in the
multi-task situation, or fine-tunes two pre-traine...

#

Relevant tldr diagram

bitter turtle Sep 5, 2023, 4:07 PM

#

wtf.

#

tbf I haven't tried 'KL under reconstruction' before so not sure what I should expect

#

this is also layer 5

#

so maybe thats weird

bitter turtle Sep 5, 2023, 4:38 PM

#

yeah looks like a product of being layer 5, this is layer 2

#

not seeing too much difference between centered vs not

keen pivot Sep 5, 2023, 6:16 PM

#

bitter turtle yeah looks like a product of being layer 5, this is layer 2

Okay, this is how I'm framing this:
We know gpt2 does better on perplexity for a given sparsity (which perplexity probably transfers to KL?). We could find all statistical differences between the two, and see if we could replicate one to the other (either pythia->gpt2 or vice versa).

bitter turtle Sep 5, 2023, 6:17 PM

#

not sure what you're addressing here (or what 'this' here refers to)

keen pivot Sep 5, 2023, 6:21 PM

#

bitter turtle not sure what you're addressing here (or what 'this' here refers to)

Like you & hoagy have been doing centering data in order to replicate the goodness of dicts on gpt2.

#

Since centering didn't replicate, there may be other statistics of gpt2 data that are responsible.

bitter turtle Sep 5, 2023, 6:25 PM

#

yeah im like 80% sure it's the rel size difference of the outliers

bitter turtle Sep 5, 2023, 6:25 PM

#

bitter turtle yeah im like 80% sure it's the rel size difference of the outliers

I would probably expect e.g. the sphered results for gpt-2 small and pythia to look approximately equal if this is the case

keen pivot Sep 5, 2023, 6:26 PM

#

bitter turtle yeah im like 80% sure it's the rel size difference of the outliers

Weren't you going to do the FVU on the outlier dimensions?

bitter turtle Sep 5, 2023, 6:27 PM

#

yes, haven't got around to that yet

keen pivot Sep 5, 2023, 6:27 PM

#

I think I could do one on the perplexity

#

Haven't thought through the experiment

bitter turtle Sep 5, 2023, 6:27 PM

#

I'm currently just seeing the correct L1 range for the sphered data

keen pivot Sep 5, 2023, 6:28 PM

#

keen pivot Haven't thought through the experiment

Okay, so I do reconstruction for both, but don't reconstruct the outlier dimensions, just pass those through. If perplexities are about equal then, then it's the outlier dimensions

keen pivot Sep 5, 2023, 6:28 PM

#

bitter turtle I'm currently just seeing the correct L1 range for the sphered data

What do you mean just seeing?

#

Like that's what you're currently working on?

bitter turtle Sep 5, 2023, 6:29 PM

#

figuring out/looking for

#

no, I'm currently working on getting

keen pivot Sep 5, 2023, 6:29 PM

#

Oh, ya.

bitter turtle Sep 5, 2023, 6:29 PM

#

yep

#

lol

keen pivot Sep 5, 2023, 6:29 PM

#

You just do the sparsity?

bitter turtle Sep 5, 2023, 6:29 PM

#

wdym?

keen pivot Sep 5, 2023, 6:30 PM

#

Like select l1's to get a features/datapoint (ie sparsity) between 5 & d_model.

bitter turtle Sep 5, 2023, 6:30 PM

#

oh, yeah

#

oh, I changed something and now it's the same as the other ones, duh

#

I'm dumb

bitter turtle Sep 5, 2023, 7:14 PM

#

#

relevant info; seems about the same weirdly

#

this is layer 3

#

unsure what's going on with sphered data in the low-sparsity setting, perhaps they are undertrained

keen pivot Sep 5, 2023, 7:31 PM

#

the sphered one looks like layer 5.

bitter turtle Sep 5, 2023, 7:37 PM

#

maybe 'approximately equal' was a bit extreme 😅 but I think they look similar at least

#

I would maybe put this as "not conclusive but decent evidence that it's mostly the outliers"

#

this is gpt-2-small, running pythia-160m which should be more equivalent (in model size) now:

#

keen pivot Sep 5, 2023, 7:50 PM

#

bitter turtle this is gpt-2-small, running pythia-160m which should be more equivalent (in mod...

pythia-160m is a good choice!

#

What's the graph? You say

this is gpt-2-small, running pythia-160m
, so is the graph gpt2 or pythia160m?

bitter turtle Sep 5, 2023, 7:50 PM

#

oh, this is gpt-2 small

#

pythia-160m 🤔

#

I'd say there is definitely some other structural difference here then

#

(in addition to the outliers being vastly more significant compared to the norm)

bitter turtle Sep 5, 2023, 8:31 PM

#

I don't think this is a very informative graph, but

#

gpt2-small

#

pythia

#

contribution looks ~constant by sparsity

#

basically shows that 'if you only allow the top-2 directions (the outliers), gpt-2 has better performance than pythia' which would be evidence for 'gpt-2 is better because it can proportionally represent outliers better'

#

@keen pivot

keen pivot Sep 5, 2023, 8:53 PM

#

bitter turtle <@360082080975290369>

Would you be able to put the two on the same plot, and much different colors?

bitter turtle Sep 5, 2023, 8:54 PM

#

they have the same axes, but yees?

keen pivot Sep 5, 2023, 8:54 PM

#

Wait no!

bitter turtle Sep 5, 2023, 8:55 PM

#

????

keen pivot Sep 5, 2023, 8:55 PM

#

Okay, I see that mean centering makes it suck

#

for both

bitter turtle Sep 5, 2023, 8:55 PM

#

I think this makes sense, like outliers were typically very positive or very negative, and not both, so they can't nicely be captured by a sparse code which is mean-centered

keen pivot Sep 5, 2023, 8:56 PM

#

How does centering relate to outliers? If you center, then outliers contribute less to variance?

bitter turtle Sep 5, 2023, 8:56 PM

#

well, basically I don't think the outlier dims are mean-centered particularly.

keen pivot Sep 5, 2023, 8:56 PM

#

bitter turtle well, basically I don't think the outlier dims are mean-centered particularly.

They're not

bitter turtle Sep 5, 2023, 8:56 PM

#

well then

#

mean centering is 'wrong' for the outliers

keen pivot Sep 5, 2023, 8:57 PM

#

And doing this to other dimensions has little effect?

bitter turtle Sep 5, 2023, 8:57 PM

#

doing what to other dimensions?

keen pivot Sep 5, 2023, 8:57 PM

#

mean centering

bitter turtle Sep 5, 2023, 8:58 PM

#

seemingly overall it has little effect

#

overall I think mean centering is 'closer' to the correct value

#

hold on, median-centering might be closer to being correct

keen pivot Sep 5, 2023, 9:04 PM

#

I did try learning dictionaries that didn't include the outlier dimensions, & they didn't have much improved performance.

#

I should get on the perplexity-but-exclude-outlier-dimensions things

keen pivot Sep 5, 2023, 10:34 PM

#

No big difference when removing top-2 outlier dims

#

I mean, a big difference for the really awful l1s' for pythia

keen pivot Sep 5, 2023, 11:16 PM

#

#

And zooming in:

#

#

And for sure, the perplexity-diff goes to 0 if you use my code and find the top 500 outlier dims, replacing them, so that probably works.

#

So outlier dimensions don't explain the difference in perplexity

bitter turtle Sep 5, 2023, 11:20 PM

#

Could you summarise your interpretations of these plots?

keen pivot Sep 5, 2023, 11:24 PM

#

bitter turtle Could you summarise your interpretations of these plots?

Maybe the difference in perplexity is because gpt2 better reconstructs its outlier dimensions, so let's just run the perplexity-under-reconstruction both normally and when "carrying through" 2 outlier dimensions (ie just replace the reconstruction of the outlier dimensions with the actual outlier dimensions).

If the "carry through top-2 outlier dims" one makes the perplexities match more between gpt2 & pythia, then the cause is the outlier dimensions.

In the graph above, it doesn't really make a difference at all, so the outlier dimensions are not the cause

bitter turtle Sep 5, 2023, 11:30 PM

#

Hmm, I don't think GPT-2 "better reconstructs" it's outliers necessarily, I'd expect e.g. FUV on only the outlier dimensions (note that this is different to what I tested before, before I tested 'FUV on the entire thing, but only allowing outlier features to be active') to be about the same between the models.

#

I just think that both dicts learn to predict the outliers fairly well/almost perfectly, since there's a strong incentive to do so compared to all the other dimensions. Then, GPT-2 gets better overall FUV because more of the norm is in something it learns perfectly, the outliers, hence the lack of a change in your plots.

#

@keen pivot

glass tinsel Sep 6, 2023, 3:39 AM

#

keen pivot

sorry stupid question is this using the L2 loss or the cross entropy loss? if you train the autoencoder to minimize CE it should just automatically do the right thing with the outliers right?

bitter turtle Sep 6, 2023, 7:52 AM

#

glass tinsel sorry stupid question is this using the L2 loss or the cross entropy loss? if yo...

Ah, we where trying to explain the difference in performance between our GPT2-small and Pythia models, I think the end-to-end training thing is separate.

glass tinsel Sep 6, 2023, 7:53 AM

#

ok

bitter turtle Sep 6, 2023, 10:07 AM

#

erasure across depth scores for 410m, slightly wild

#

I'm still concerned I'm using LEACE unfairly, I might try and 'fake' a more in-distribution dataset by prepending random samples of the Pile or something.

bitter turtle Sep 6, 2023, 1:08 PM

#

runs using 10-shot and 6-shot prompting respectively. (the one on the right, where LEACE is worse, is 6-shot prompting)

#

Example prompt so you get a feel for what I'm doing:

My name is Connie and I am a female. My name is Eva and I am a female. My name is Mary and I am a male. My name is Paris and I am a female. My name is Jamie and I am a female. My name is Dorothy and I am a female. My name is Edison and I am a male. My name is Alex and I am a male. My name is 
Maurice and I am a male. My name is Cary and I am a male. My name is Ana and I am a[completion]

#

kind of spooky dataset tbh

keen pivot Sep 6, 2023, 2:04 PM

#

bitter turtle I just think that both dicts learn to predict the outliers fairly well/almost pe...

The only thing it doesn’t predict is the perplexity is better for GPT2 dicts.

keen pivot Sep 6, 2023, 2:07 PM

#

bitter turtle Example prompt so you get a feel for what I'm doing: ``` My name is Connie and I...

Why is Mary a male? Unless this is just an example

bitter turtle Sep 6, 2023, 2:08 PM

#

what

keen pivot Sep 6, 2023, 2:08 PM

#

bitter turtle runs using 10-shot and 6-shot prompting respectively. (the one on the right, whe...

Why show the mean edit magnitude over the KL diff? I also have no idea what the mean edit magnitude means.

bitter turtle Sep 6, 2023, 2:08 PM

#

checking dataset rn

keen pivot Sep 6, 2023, 2:09 PM

#

I mean, Mary can be whoever they want to be, but like stereotypes, you know?

keen pivot Sep 6, 2023, 3:38 PM

#

@bitter turtle, making a circuit for a few layers from the feature you found for gender would be cool. Like I can manually interp the ones in layer 8 which seem to do good, & back-chain to previous layer features. I could also do this for mlp_out & attn_out.

bitter turtle Sep 6, 2023, 4:11 PM

#

kl divergences for ablations on (presumably flawed 🤔) datase

bitter turtle Sep 6, 2023, 4:18 PM

#

keen pivot I mean, Mary can be whoever they want to be, but like stereotypes, you know?

turns out I had set the threshold for 'name commonness' to an OOM less than I meant to

#

rerunning

glass tinsel Sep 6, 2023, 7:14 PM

#

bitter turtle erasure across depth scores for 410m, slightly wild

is there some concise explanation of what this experiment is about

bitter turtle Sep 6, 2023, 7:41 PM

#

glass tinsel is there some concise explanation of what this experiment is about

"test how different erasure techniques impact model capabilities on a simple task"

#

(except doing concept erasure at a specific layer, and not concept scrubbing)

glass tinsel Sep 6, 2023, 7:42 PM

#

got it, what concept are you erasing?

bitter turtle Sep 6, 2023, 7:42 PM

#

gender prediction from name

glass tinsel Sep 6, 2023, 7:43 PM

#

got it. how does this work for the dictionary feature thing?

#

like you just locate the feature most correlated with gender?

bitter turtle Sep 6, 2023, 7:43 PM

#

no, it's stupider than that; filter for features above a freq. activation threshold, and compare their erasure ability on a test dataset

glass tinsel Sep 6, 2023, 7:44 PM

#

what is erasure ability? like fitting a probe on the post-erasure activations?

#

and seeing the loss

bitter turtle Sep 6, 2023, 7:44 PM

#

sorry, end-to-end model score on the test dataset

glass tinsel Sep 6, 2023, 7:52 PM

#

oh ok so the model itself is prompted to predict gender and then you also try erasing gender from the activations and see how bad that makes the model?

glass tinsel Sep 6, 2023, 7:54 PM

#

bitter turtle erasure across depth scores for 410m, slightly wild

What is "mean" here?

#

Also how do you actually perform the dictionary-based erasure

bitter turtle Sep 6, 2023, 7:57 PM

#

glass tinsel Also how do you actually perform the dictionary-based erasure

project the direction learnt by the dictionary to 0

bitter turtle Sep 6, 2023, 7:57 PM

#

glass tinsel What is "mean" here?

project difference-in-means vector to 0

glass tinsel Sep 6, 2023, 7:57 PM

#

bitter turtle project the direction learnt by the dictionary to 0

so it's an orthogonal projection?

bitter turtle Sep 6, 2023, 7:57 PM

#

yes

glass tinsel Sep 6, 2023, 7:58 PM

#

in the like, activation space

bitter turtle Sep 6, 2023, 7:58 PM

#

yes

glass tinsel Sep 6, 2023, 7:58 PM

#

not in some higher dimensional space in which the overcomplete basis is orthogonal

#

okay

glass tinsel Sep 6, 2023, 7:59 PM

#

bitter turtle project difference-in-means vector to 0

What LEACE does is very similar to this, as I'm sure you know. Both methods are linear projections with the same null space and only differ in their column space

glass tinsel Sep 6, 2023, 8:00 PM

#

bitter turtle erasure across depth scores for 410m, slightly wild

On the layers where the edit magnitude is higher for LEACE than for orthogonally projecting onto the difference in means direction, there must be some bug or some distribution shift because we prove that LEACE can be no worse than the orthogonal projection here

#

So I'd first want to investigate this issue

bitter turtle Sep 6, 2023, 8:01 PM

#

yes, I was very confused by that.

glass tinsel Sep 6, 2023, 8:01 PM

#

And once you've sorted that out, if LEACE still has a smaller effect on perf than orthogonal projection, then I would say that's expected

#

And the extra causal effect of orthogonal projection is due to side effects

bitter turtle Sep 6, 2023, 8:02 PM

#

glass tinsel On the layers where the edit magnitude is higher for LEACE than for orthogonally...

what do you mean by 'distribution shift' here?

glass tinsel Sep 6, 2023, 8:02 PM

#

like you fit the eraser on one distribution and apply it on another?

bitter turtle Sep 6, 2023, 8:02 PM

#

ah, ok, no that shouldn't be happening

glass tinsel Sep 6, 2023, 8:02 PM

#

ok

#

how big is the dataset

bitter turtle Sep 6, 2023, 8:03 PM

#

1000 prompts or so

glass tinsel Sep 6, 2023, 8:03 PM

#

glass tinsel What LEACE does is _very_ similar to this, as I'm sure you know. Both methods ar...

I suppose the other issue is we mean ablate rather than zero ablate

#

I would do a sanity check with setting method="orth" and affine=False on LeaceFitter because in theory that should give identical results to your Mean method

bitter turtle Sep 6, 2023, 8:05 PM

#

yep

glass tinsel Sep 6, 2023, 8:07 PM

#

I should note that while we did do one of these destructive intervention experiments in the LEACE paper

#

I think that the gold standard should really be to achieve fine grained control over model output

#

That's much harder to achieve and also more practically useful

#

In particular I think evaluating methods by how much they screw up the model is sort of perverse; in the LEACE paper we actually do the opposite (lower perplexity / KL is better) since that suggests better surgicality

#

There are lots of ways to screw up a model; trivially you can replace activations with constants or i.i.d. noise

bitter turtle Sep 6, 2023, 8:09 PM

#

glass tinsel In particular I think evaluating methods by _how much they screw up the model_ i...

I'm kind of trying to do this here

glass tinsel Sep 6, 2023, 8:09 PM

#

lower perplexity is better?

bitter turtle Sep 6, 2023, 8:09 PM

#

I'm evaluating KL divergence from the base model under the fitted intervention on a subset of the Pile

glass tinsel Sep 6, 2023, 8:09 PM

#

mmm

bitter turtle Sep 6, 2023, 8:11 PM

#

bitter turtle kl divergences for ablations on (presumably flawed 🤔) datase

@glass tinsel this is KL-divergence on the Pile under the learned interventions (ignore LEACE probably); the idea was that since our directions are sparsely activating, hopefully we might see that ablating them brings the activations less out-of-distribution, and so we see less KL divergence. However, it seems to be very dataset-dependent

glass tinsel Sep 6, 2023, 8:13 PM

#

hmm yeah I think you always need to be looking at both effectiveness and surgicality simultaneously

#

bc the identity intervention achieves perfect surgicality

glass tinsel Sep 6, 2023, 8:14 PM

#

glass tinsel hmm yeah I think you always need to be looking at both effectiveness and surgica...

(fwiw I think the LEACE paper should have been better at this, and I'm trying to think of ways to measure it rn)

bitter turtle Sep 6, 2023, 8:14 PM

#

to clarify I'm not actually that hopeful about this actually being a useful concept editing method, but was just trying to illustrate a possible application/evidence for the learned directions being 'meaningful'

glass tinsel Sep 6, 2023, 8:14 PM

#

hmm

#

See I could imagine that this might actually be a useful method

#

the end to end version

#

not the reconstruction loss version

#

bc the end to end version is taking into account which directions are most important

bitter turtle Sep 6, 2023, 8:16 PM

#

agree

#

the autoencoders are not good enough for that yet, though (they are fairly lossy), and I don't really see the benefit of this over e.g. activation editing + good understanding of transformer activation pdf for not-too-insanely-ood activation engineering

glass tinsel Sep 6, 2023, 8:19 PM

#

how are you planning to make the autoencoders better

bitter turtle Sep 6, 2023, 8:19 PM

#

I think FISTA for actually good and not terrible sparse coding given a learned dictionary is probably the safest bet.

#

you can do something like K-SVD to optimise dictionaries using FISTA as an encoder

bitter turtle Sep 6, 2023, 9:51 PM

#

fixed the distributional shift (I was ablating from all token positions, including the prompt), and we get these results.

keen pivot Sep 6, 2023, 10:23 PM

#

bitter turtle fixed the distributional shift (I was ablating from all token positions, includi...

Is LEACE just bad here because it's only one layer, or also the few shot prompting/iid thing?

bitter turtle Sep 6, 2023, 10:25 PM

#

weird mix of reasons, mostly because it shifts activations ood

bitter turtle Sep 6, 2023, 11:52 PM

#

bitter turtle fixed the distributional shift (I was ablating from all token positions, includi...

kl for these results. holy shit I need to optimise, this took ages

glass tinsel Sep 7, 2023, 9:09 AM

#

keen pivot Is LEACE just bad here because it's only one layer, or also the few shot prompti...

fwiw it's not clear to me that these results are "bad" for LEACE— they could be interpreted as good because LEACE is being more surgical. The other methods may be harming prediction ability through "spurious" channels

#

If you want to go hardcore you could do Quadratic LEACE. I'm actually kind of curious how big the edit magnitude would be. Haven't gotten around to measuring that for the paper yet.

bitter turtle Sep 7, 2023, 11:26 AM

#

glass tinsel fwiw it's not clear to me that these results are "bad" for LEACE— they could be ...

I meant to test this via transfer to a different dataset, which is kind of like most of the point, but never got around to it

keen pivot Sep 7, 2023, 2:24 PM

#

bitter turtle I meant to test this via transfer to a different dataset, which is kind of like ...

Isn't this shown in the KL divergence though?

bitter turtle Sep 7, 2023, 2:44 PM

#

The surgicality is, yes, but you can't reasonably conclude it's not doing something spurious without transfer

keen pivot Sep 7, 2023, 3:03 PM

#

bitter turtle The surgicality is, yes, but you can't reasonably conclude it's not doing someth...

Is this KL on the gender dataset, and transfer is KL on e.g. Pile-10k?

bitter turtle Sep 7, 2023, 3:04 PM

#

no, this is KL on pile 10k, and transfer would e.g. be using the same interventions to measure perf change on a different but correlated task

keen pivot Sep 7, 2023, 3:23 PM

#

bitter turtle no, this is KL on pile 10k, and transfer would e.g. be using the same interventi...

I think you gotta look at the specific features being ablated to do better generalization. Dictionaries give you that option (since there are many features found).

#Sparse Coding