#Sparse Coding

1 messages · Page 5 of 1

shell mural
#

I dont think similarity between bases means anything if they're for different layers. Different layer different distribution

#

Yea this seems reasonable to me

bitter turtle
#

I thought you were looking for similarity between the bases for different layers; I think I'm very confused at a high level here about

  • what you are trying to do
  • how you are trying to do it
shell mural
#

My bad for poor communication lol

bitter turtle
#

no problem

shell mural
#

At the highest level I'm aiming for much stronger circuit analysis

#

And strong analysis of complex circuits

#

What this might look like is you fix some dataset/distribution of data, or maybe some benchmark/eval

#

We can use various interpretability techniques to narrow down what attention heads and MLPs are involved, make hypotheses for what functional purpose they serve, maybe do some causal scrubbing

#

And then great we have a giant computational graph that represents the circuit and appears to be the main stuff that matters for the behavior of interest

#

We can probably even use dictionary learning to make the graph and circuit even better

#

Logan did it before

#

But a missing component I'm seeing in interpretability is the ability to analyze (and utilize!) the edges of that graph. Like we know that sure the output of this plugs into the input of that, but what is that input/output

#

Well, skipping over more refined stuff like attention head outputs, we can probe into the residual stream at any point and use dictionary learning to decompose it into sparse combinations of features that sometimes seem to have actual meaning to them

#

What I see in the channel currently is actually we can understand the residual stream pretty damn well actually. This seems pretty huge to me

bitter turtle
#

Ok, what are those same questions but within the scope of sparse coding specifically?

shell mural
#

I want to use sparse coding on the residual stream before and after a transformer block, to better understand what the transformer block is doing

#

I would rather work in
A -> Attn -> A -> MLP -> A
Rather than
B -> Attn -> C -> MLP -> D
Where A,B,C,D are dictionaries for the residual stream.
A is trained on data from different spots
B,C,D are localized to that spot

bitter turtle
#

I guess my confusion is that doesn't that implicitly make the same assumptions as this?

shell mural
#

Hmm. Maybe? 😅

#

I'm boarding a flight now, I'll have time to think about this more

#

But I think you're right lol. I feel pretty dumb now

shell mural
#

Nvm back in my camp. no i don't think it implicitly makes those assumptions, though we may end up having less interpretable individual features. Hmmm

bitter turtle
pallid current
#

question: do you think that for all the autointerp stuff we should run it on a different chunk of data to the ones that the models/decompositions have been trained on?

#

at the moment we use chunk 0 to train all of the decompositions and then also for all of the interp stuff

hallow wyvern
#

BTW with very minimal modification to your code, I was able to train an autoencoder on roneneldan/tinystories-33m, and it finds meaningful features. Behold, the mojibake feature:

#

I'll clean up the changes tonight and send in a PR to support that model and to describe the process in the readme, I don't think it'll conflict with any of the stuff you've done recently

Edit: OK, there was more cleanup than I thought. Making PR tomorrow.

pallid current
#

have also asked lee, looks like he's online

pallid current
#

wanted to check whether we tend to find directions which match up with either embed or unembed matrices, but if i've not made any mistakes then seems like there's pretty much none of that, which is honestly very surprising to me

#

layer patterns seem right

#

i thought r32 layer 0, which is the case where it benefits from the large ratio the most, might be really latching onto tokens

keen pivot
# pallid current

Oh cool! It is quite above random, though I really do expect layer 5 to have way higher cs than layer 4 in the unembed. Maybe something going on w/ the mean, since the last layer learns less features than the other layers (ie has more dead features).

#

I do expect unembedding dims to be a linear combination of dictionary features (in the last layer), and to priviledge more common tokens than less common.

keen pivot
#

Also, there's minimal code to be able to search for features that activate in the last token position of some custom text in my notebook if you've seen it.

bitter turtle
pallid current
#

i should be able to do this today but i feel pretty awful atm so if you did thatd be great

pallid current
pallid current
bitter turtle
pallid current
#

no parallelization tbf so it could be way faster

keen pivot
#

@pallid current

pallid current
#

what's the actual task that this data comes from?

pallid current
# bitter turtle Oh lol

wait sorry id been totally stupid, the interp stuff uses openwebtext by default so it's already off the training distribution

bitter turtle
pallid current
#

early stages in doing the big auto interp results but early returns are pretty based

#

lot of recent anxiety about ica quelled by this image lol

pallid current
#

general picture emerging is that winrate of sparse coding vs all baselines on top or top-random is very high, but a lot lower on random-only, where ica-top-k is quite competitive and sometimes our dicts just do quite poorly

rancid summit
#

what's the rough normalized MSE / sparsity you're getting in gpt2sm?

pallid current
#

hey, these are the graphs we're getting in the residual stream for pythia70M, we dont have gpt2sm results to hand i dont think

rancid summit
#

hmm how are you calculating unexplained variance? (or is that the same thing as normalized MSE)?

#

(normalized MSE: MSE / (target -target.mean())**2)

pallid current
rancid summit
#

also the points near 0 are a bit hard to tell, is there a log plot?

rancid summit
#

oh also how big is the latent here

pallid current
rancid summit
#

no worries no need to go out of your way

pallid current
#

we've generally focused on regime of 100 or fewer active features, though the autointerp metrics are surprisingly not that strongly correlated with l1 coef

rancid summit
#

yeah just trying to get a sense of what reasonable normalized MSE scores are

#

because the anthropic toy model results were like 1e-3 and I was like "woah that's pretty low"

rancid summit
#

yeah this was in one of the recent updates

#

the plots with the bounce

pallid current
#

i think we've kind of stopped looking for an exact right like dict size or l1 value like they saw with that bounce, like maybe there is some way but with the llms/autoencoders we're using it seems to always be pretty smooth

rancid summit
#

I see

#

another Q: any intuition on where on the sparsity/reconstruction tradeoff you want to be?

#

(context: I'm currently doing something like the kurtosis based autoencoder thing I described last time, and sparsity is controlled with a very weird set of hparams (as opposed to just tuning L1), so I haven't been paying it much attention. but maybe I should)

pallid current
#

the way i was hoping to answer it was to use the autointerp scores, using random scoring, to quantify the overall % of the variance of the layer which is captured by the explanation

rancid summit
#

I see

pallid current
#

basically weighting interp scores by feature variance. issue is that if there's signal in the interp scores of different dict sizes and l1 values (within a reasonable range) then it's pretty slight

#

also the performance of these dicts is strongest on top or top-random scoring, the random performance is a little disappointing

rancid summit
#

maybe worth trying random-among-activating?

#

like random but throw out anything that the relu clamps to 0

pallid current
#

oh interesting, yeah we're screening sentences for nonzero variance but i guess you mean not including nonzeros in the correlation calculation?

#

might also be hurt by having to use 3.5 as the simulator, when i've looked at the simulations they're quite painfully dumb often

rancid summit
#

4 is better but also pretty dumb

#

tbh I don't actually know what would be a really good metric

#

also I'm surprised your dictionary is so small

#

and the sparsity is not that high as a fraction of the dictionary

#

does larger dictionary+more sparse help?

#

in my experiments I've been looking at really big really sparse autoencoders (like 100 active/10k dictionary)

pallid current
#

yeah i agree it's a bit surprising how well the small dicts work, i think there's some reason that sparse coding seems to work that we dont fully understand/doesn't match our intuitions - though we do go up to 32x ratio = 16k feats, or 72k feats for MLP (though those were probably a mistake 😅). i think it might be that the larger ones take longer to converge and also that they struggle to learn closely related features

rancid summit
#

interesting

#

wdym by struggling to learn closely related features?

pallid current
#

you start to get lots of features dying by that stage. we talked about trying reinitialization methods where if a feature hardly ever activates it you reinitialize it, randomly or with some residual vector or something but we haven't put time into it

rancid summit
#

ah I see

pallid current
rancid summit
#

got it

pallid current
#

hopefully will publish current results to get some interest and then try to dig into really understanding what's going on a bit more

rancid summit
#

makes sense

#

excited for the paper!

pallid current
#

thanks! would love to get your feedback on a draft in a week or two if you've got time

rancid summit
#

yeah will def take a look

#

are you planning to run any gpt2sm results?

pallid current
#

yeah the code is set up to be able to run it, might even have some dicts floating around already. will def run a proper set before we publish. any metrics you'd be particularly interested in?

#

i dont think we'll have the credits to do that much autointerp on it (unless we got a load more haha 👀), would def be interesting to be able to directly compare with your original results but would need gpt-4 logprobs for that

pallid current
rancid summit
#

and 100 active is not something I can easily control directly

#

there are like 4 knobs that might in theory affect the number active? but I've never actually tried to make active go up or down

pallid current
rancid summit
#

hmm

#

I've just been making the dictionary bigger and bigger lol

#

my intuition is that there must surely be lots of features that don't activate 99% of the time

#

as opposed to only somewhat more features than model dim that don't activate 70% of the time

#

unfortunately making the dictionary bigger doesn't always improve loss in my setup unless you tune some knobs

pallid current
#

yeah i agree there must in some sense be an incredibly long tail of feats, though i think that in order for the model to be able to actually work with that amount of superposition it must translate dataset features into a latent space which allows a lot of compression

#

and in that case it's unclear to me what number of feats really means

rancid summit
#

why would that be necessary?

#

you can still pick out each feature if you want to use it

#

most things just won't operate on most features but that's fine

pallid current
#

but we can tell from the non-sparsity of the neuron basis that it's not actually picking out just tiny parts of the residual stream to act on

#

so there must be some degree of similarity between how it treats similar parts of the residual stream which increases as features get closer

rancid summit
pallid current
#

i think you should be able to get some kind of interesting bound on superposition by making sure that the noise doesn't grow exponentially as you do sequential computation on the features but i've not yet had the chance to really try and flesh it out

rancid summit
#

maybe the MLP uses multiple neurons to add lots of bends to a function that operates on one feature

#

maybe the MLP is actually doing some crazy non neuron basis aligned computation

#

I don't see reason to expect the MLP basis to be anything too sane

pallid current
#

like the position of the nonlinearity will change a lot depending on which other features are active which makes it super hard to build a complex nonlinear func

#

i ran some mlp tests recently to try and gauge to what extent you still get a robustly nonlinear response curve when features were distributed across multiple neurons and it seems that it gets pretty linear after you start to spread your mlp features over just a handful of neurons

#

but then clearly we do seem to see feats across neurons, from sparse probing and sparse coding etc

#

so im pretty confused basically

pallid current
# bitter turtle What were these tests?

generating a synthetic dataset as we would for toy models but constraining each feature to be across no more than n dimensions, and then for each feature, reading it directly with a linear layer + GELU and seeing whether you still got a real nonlinearity

#

so with n=1 you just see standard GELU

#

but as you increase n you start to see a pretty flat response curve

#

this is, for 200 dim space, n=1, 200 feats; n=3 500 feats; n=10 1000 feats

#

bump at 2 is an artefact of how i do sparsity

bitter turtle
#

Ok, what's the 'linear layer' in the 'linear layer + GELU' here

pallid current
bitter turtle
#

Oh ok

pallid current
#

hope that makes sense lol, can explain in the morn, am off to bed now

bitter turtle
#

see you tomorrow/at meeting

pallid current
#

feel like ive never managed to explain it well which might mean im chatting shit 😅 , see ya

bitter turtle
#

I'm just quite confused as to why you think this shows anything about MLP structure ig I misunderstood maybe, is the structure superposed features -> GELU -> linear filter, or superposed features -> linear filter -> GELU?

pallid current
#

its v short, will upload as a gist

bitter turtle
#

cool will read later

bitter turtle
pallid current
#

yeah thats definitely a big weakness but its also kinda generous in terms of like, only distributed across a few feats, not crazy ratios or anything. would be interesting to compare neat geometric patterns or learned feats

#

but i do think its useful as a counterweight to just like, johnson-lindenstrauss therefore exponentially many feats in a layer, like you cant just approach it naively and retain a meaningful nonlinearity, (unless ive misunderstood somehow)

#

should probably just throw sparse learned feats or something similar into it and see wht comes out

pallid current
#

ok its very annoying that we didnt realise this earlier but there's a huge difference between tied and untied dictionaries in the MLP

#

at least by n feats active

#

e.g. compare number / % active for layer 2, second one is untied

keen pivot
pallid current
bitter turtle
#

Ah

pallid current
#

mmcs for untied on MLP is pretty terrible which somewhat explains the bad results

#

doesnt explain why we're getting those bad mmcs scores though! what's changed??

#

@keen pivot have you been doing any MLP training runs recently on the old code? wanna compare hparams

keen pivot
pallid current
#

what does mmcs look like over epochs?

#

hmm yeah i can try with 1e-3

keen pivot
#

I've only got MLP_out.

#

But I got awful MCS hist doing 3e-4, and just got much better doing 1e-3

pallid current
#

run just now?

keen pivot
#

Yep

#

20 chunks

#

10 chunks

keen pivot
# keen pivot 10 chunks

The above is 1e-3. Doing 3e-4 for 10 chunks, it produced one like this (peaks around 0.4), but maybe 30 above 0.9

bitter turtle
#

hhhhhhhhhhhhhhuh

#

that's incredibly weird and I don't understand that at all.

#

So I switched to full-rank ablations on a whim and we still beat LEACE at earlier layers in the model? Need to reconfigure feature selection procedure to count and compare with different database sizes, but it looks like we are finding a single direction to do full-rank ablations on better than LEACE

pallid current
bitter turtle
#

yeah, logan did, 722 (main one) activates on female pronouns or something, which is slightly confusing? going to check if it's gaming the metric somehow

#

perfectly plausible that it doesn't make much sense to do this kind of thing on a model with such low baseline prediciton accuracy

pallid current
keen pivot
bitter turtle
#

oh, for sure

keen pivot
#

Not too difficult to just grab a middle layer of Pythia 1.4b to train a dict real quick & re-do results, if you want?

bitter turtle
#

yeah will do soonish

bitter turtle
keen pivot
bitter turtle
#

👍

keen pivot
#

@pallid current , Is there an easy way for me to load in the dataset of the first chunk (specifically what PCA/ICA were trained on)? I don't want just the layer activations, but the original text/tokens

pallid current
#

hmmm not like suuper easy, depends what exactly you need. in activation_dataset.py you can use make_sentence_dataset to get you a big load of sentences from the beginning of the pile. you can calculate tokenize each sentence and calc how many tokens will go into 1 chunk (2GB / 4 bytes i think) and just stop there and that's your data

#

note that ICA is trained for 1 chunk but PCA batched across 10

bitter turtle
#

wait one second logan I have a util for this standby pushing now

pallid current
#

if you want more exact idea of what does in you'd need o modify chunk and tokenize to return sentences

bitter turtle
#

on my branch I edited activation_dataset.py and added a script to generate chunks

keen pivot
# bitter turtle 👍

I checked. I did
layer=15,
l1 = 1e-3 (w/ 12k, I predict sparsity of ~50-60)
dict=6x is probably best.
n_chunks: probably 20 (I did 10, but was still converging on the 8x dict)

tied, lr=1e-3, though I also had a bias for the decoder.

bitter turtle
#

I expect not having decoder bias is ~fine, should be ~centered at 0

bitter turtle
pallid current
#

mmcs looks good!

bitter turtle
#

for what?

pallid current
#

^ this image if this is to me

bitter turtle
#

ah, sure

#

@pallid current did a PR, this one is probably a bit messy and awful to work through sorry, since I rewrote some datagen code to make it work with more models & added a util for saving activations

keen pivot
#

From the overview button (on the top-left of wandb)

#

Though I would like to note that I trained it for 10 mini_runs over 20 chunks. So 200 chunks!

bitter turtle
#

Oh jesus

keen pivot
#

This one is for sure tied ae, and still converging (when looking at MMCS) after 50 chunks

#

This is an error on my part, because I mixed up the "n_chunks" & "mini-runs" part. I do think dictionaries trained under 10 chunks (which is what we've normally done) are undertrained for larger dicts

#

This is for 10 chunks (top) and 5 chunks (bottom)

#

And this is for 50 chunks (top) and 45 chunks (bottom):

bitter turtle
#

pod shitting itself rn; what's the data in 3 and 5?

pallid current
#

dunno abt 3/5 (tho think logan's using 3), i'm running an 8gpu sweep for checking higher lr low bach

#

can shut it off in a sec

#

@keen pivot logan can you run something like that but with MLP please?

#

my small batchsize/higher lr 10 chunk untied mlp didnt see much of a shift, v poor mmcs above like 3e-4

keen pivot
pallid current
#

probably both? yeah postnonlin

#

untied if you can only do 1

keen pivot
#

@pallid current, the pca_topk is size 1k, and ica_topk is size 500 in the mnt/.../baselines/ folder. Is there a same-sized version.

keen pivot
bitter turtle
pallid current
pallid current
bitter turtle
#

oh right

keen pivot
pallid current
#

yeah ideally

bitter turtle
pallid current
#

tbh i dont remember the tradeoffs in detail, but generally mcs could be solidly high at the normal l1 ranges

bitter turtle
#

incredibly confused

pallid current
keen pivot
#

Also, I had to change a lot of stuff to make it work, so like 50% I screwed up and it's running something we don't want!

keen pivot
bitter turtle
pallid current
#

whereas the ICA one doesnt do that

#

agree that shouldnt be a discrepancy

pallid current
keen pivot
pallid current
#

finally got round to properly suppressing those aten warnings, sooo much more satisfying to train now chadhuber

pallid current
#

ok we've basically waaaay undertrained our mlp dicts

#

yellow being 110 chunk-epochs, green and purple being the original 10

keen pivot
pallid current
#

mlp l1 curve every 10 epochs:

#

perf still pretty bad tbh

pallid current
keen pivot
fiery tangle
keen pivot
#

Confirmed: PCA top component is the outlier dimension

#

Guys. PCA_topk & ICA_topk suck soooo much. Like I can't even get a decent one for the apostrophe one.

keen pivot
#

This is the neuron basis. And I'm zooming in after 5.11 basically:

#

This is the best one I could find! Out of searching top-10.

keen pivot
#

BAM: Our dicts rock

#

Note: Going larger l1 (closer to the "all dead features" solution), there weren't any features found that weren't the outlier dimension.

Going smaller l1 (closer to the "identity/polysemantic" solution), there weren't any features found that didn't also activate for >10% of the dataset.

bitter turtle
#

Explanation of why I'm no longer comparing to LEACE directly since @pallid current asked:
Given a set of labelled points X with classes Z, LEACE guarantees that no linear classifier can predict the Z_i from the X_i well. This is fine and good if your X_i have shape seq_len×d_activation but if you erase on datapoints X_i of size d_activation then you don't guarantee that the model, which essentially sees (seq_len, d_activation) can't implement a linear classifier between them (and in fact it can and does).

This is a problem for direct comparisons as

  • it's not reasonable to compare LEACE KL-divergence OOD (e.g. on the pile); plausibily still useful to measure on-distribution
  • ablating a single feature direction is equivalent to a rank-seq_len ablation in the seq_len×d_activation space, so the ablations aren't the same

More generally I think we are aiming at much more general activation engineering with sparse feature dictionaries and it's not clear how to measure this

pallid current
pallid current
bitter turtle
#

well, even with the proper ablations it's not perfect at low levels. At layer 6 on pythia-1.4b it get something like 0.56 which is still quite bad and worse than ours

#

I mean, I think I could still compare KL divergence on-distribution, might be something there

#

Like, even if our ablations are rank-k if we get a lower KL it's useful

pallid current
bitter turtle
#

I think so yeah

#

I'll probably write up those results

#

I'll be surprised if we do get better KL divergence because that would have weird implications for models using data mostly linearly or not. LEACE guarantees minimal edit for no linear classification under any inner product norm, including the one 'expected change to KL div per unit shift in this axis' although realistically it's probably horribly nonlinear

#

Yeah nvm that it's going to be horribly nonlinear

pallid current
#

right as i understand LEACE it removes the ability to predict from the activations at that layer, but assuming that there's non-linear work done between that layer and the output, its plausible that you can improve on LEACE in terms of whether there's like gender differences in the output

#

but then i suppose you could say even if there's non-linear computation ongoing between layer and output, if at any residual layer the information is stored linearly then linear concept erasure should be able to remove it

keen pivot
#

@bronze wraith

pallid current
keen pivot
#

@bronze wraith

bitter turtle
#

kind of want to see what we can do with a synthetic labelled dataset generated by e.g. GPT-3.5

#

not sure how you'd show very general erasure though

glass tinsel
#

@bitter turtle I have very little idea what this experiment actually is so you'll have to explain it to me from the beginning

bitter turtle
#

that is entirely fair

bitter turtle
glass tinsel
#

oh ok lol

#

also, do you have access to concept labels at inference time in this experiment?

glass tinsel
#

Oracle LEACE drops the assumption that the erasure function has to be an affine transformation, and for each component of the representation in any arbitrarily selected orthonormal basis, directly minimizes the squared distance between the original and the scrubbed value

#

If that still doesn't work you could try Quadratic LEACE, which I've mostly implemented but haven't merged into main yet. It's definitely going to be a less surgical edit but it ensures that no linear or quadratic classifier can extract any info about the concept.

#

Your use case might help me decide a couple details about how to implement QuadraticFitter and QuadraticEraser. Quadratic LEACE is an oracle method, so it requires concept labels at inference time, and it also has this weird property where you have to like "dispatch" each individual data point to a different affine transform depending on its concept label, which is hard to do in an efficient vectorized way. You can do it quasi efficiently with like torch.unique but I'm still unsure about whether to do this batching/preprocessing step on every call to QuadraticEraser.forward or force the user to do the preprocessing all at once or smth

bitter turtle
#

Wasn't planning to give it access to labels at inference time/use oracle erasers, but maybe later on I will.

glass tinsel
#

ok

pallid current
#

made some big summary graphs for residual stream

#

dip in quality seen from some of the 16 and 32 ratio dicts indicates they might be a bit undertrained

glass tinsel
#

Have you guys tried end-to-end training the dictionary to minimize loss btw? I suggested this to @keen pivot and he said it was part of the plan

keen pivot
keen pivot
#

Okay, just couldn't fall asleep, so I checked some perplexity stuff. Currently getting quite good perplexity diff relative to what I got earlier, even on only 10 chunks (like what?). Gonna run a few tests to check it out & see if I'm seeing straight.

bitter turtle
#

currently slightly unsure what activation editing directions to persue, could look at

  • ability for activation editing with feature dictionaries to generalise to multiple tasks (not sure what this would look like from a testing perspective)
  • comparison to activation editing strategies like nulling the mean-diff vector
#

I expect general problems with robustness because it's not even clear what it means to do distributionally-robust activation editing at this point

bitter turtle
#

I think what I'll end up doing for the paper is more like what @keen pivot's doing at the moment, I'll park more ambitious things for another time

bitter turtle
#

Still got these results although need proper L1=0 run for baselining @pallid current

keen pivot
bitter turtle
#

yeah still have that to do

#

Not entirely sure what I should be doing with that, can't see any immediate inroads to distilling the relationships between features

keen pivot
bitter turtle
#

not sure that's neccesarily a useful distilation bc obviously it's very dependent on the nonlinearity

keen pivot
#

Sure, but it might work?

bitter turtle
#

sure

keen pivot
#

I tried from layer 4 to 5 and it sucked, but this might work

#

You could even try applying the layernorm and GeLU before doing MCS to see if it’s meaningful.

bitter turtle
#

doubtful

keen pivot
#

In my heart-of-hearts I believe it may work best by integrating one in-distribution text, but that's no longer just weight-based

bitter turtle
#

yes I think the focus on purely weight-based analysis is somewhat misguided

keen pivot
#

But it would be so cool

keen pivot
#

Ya, this is basically it. I was only able to get to 30 perplexity by going 6x. 25 is the dream. You could get there by going polysemantic, but no point at that point.

#

Would be interested to see doing KL-div training at this point!

bitter turtle
#

For sure can do those graphs shortly

#

What dicts are those?

frank vortex
#

check it out i found a math feature

#

I was trying to find features that alight with themselves more after they are transformed by a head

#

that is feature that has high cos_sim(W_OV f, f) where f is a learnt dictionary feature and W_OV is a tranformation from residual stream to residual stream

#

For some reason, features that are sorted in this way tend to be highly interpretable(there seems to be a context)

bitter turtle
#

@keen pivot the cosine sims of MLP-out don't look to be especially interpretable, this is a histogram of the gini indexes of MLP_dict @ resid_dict, i.e. a measure of how spread out the effect of one MLP-out direction activating should be; ideally if it were interpretable the effect would be sparse and we'd see this in the plot

#

hists of cosine sim

#

I'm honestly thinking i've got the wrong dicts 🤔

#

it is mlp_out_l2 that flows into residual_l2 right @pallid current?

keen pivot
bitter turtle
#

ok so it works with residual thingies in each layer with REALLY HIGH l1 values (like the highest we use). 1e-3 doesn't seem to have this.

#

this is between l1 and l2 btw will check others soon

#

@keen pivot didn't you do something similar to this I can't remember

keen pivot
keen pivot
bitter turtle
keen pivot
bitter turtle
#

token-level stuff

keen pivot
bitter turtle
keen pivot
#

I'd expect more re-tokenization/bigram-esque stuff in earlier layers

#

Yep yep. A statistical argument makes sense.

bitter turtle
#

seems better for larger dictionaries (still resid-to-resid, but this time r=32)

keen pivot
keen pivot
#

Also is this the same layer's W_OV or the next layer's?

keen pivot
bitter turtle
#

ikr

keen pivot
#

Which models are you loading in? Like the directory?

bitter turtle
#

bigrun0308/...

#

uh l3 l4 ratio 32 residual

keen pivot
#

Could you try tiedlong_residual... in Hoagy's?

bitter turtle
#

/mnt/ssd-cluster/bigrun0308/tied_residual_l3_r32/_9/learned_dicts.pt
/mnt/ssd-cluster/bigrun0308/tied_residual_l4_r32/_9/learned_dicts.pt

bitter turtle
#

is that in hoagy's?

#

one sec

#

oh cool

keen pivot
#

Residual in Hoagy's base directory.

#

Nice! So just trained more or something

bitter turtle
#

@pallid current

keen pivot
#

sparse_coding_hoagy/tiedlong_tied_residual_l4_r4/_80/learned_dicts.p

#

Except the last one for whatever reason.

bitter turtle
#

ah

#

ignore the title: this is MLP out r8 for l3 at high l1; much better

bitter turtle
#

also not seeing much difference for higher l1 which is a something sign

bitter turtle
#

@pallid current do you have zero l1 baselines?

pallid current
#

morning everyone 🙂

pallid current
bitter turtle
#

ah, what about resid

pallid current
#

not for residual, at least across the model yet, soz

bitter turtle
#

aok

pallid current
#

run died because i had one dict ratio as an int not float lmao [0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16]:

#

but only for untied 🤔 🤔

#

might not be that tbf

frank vortex
# keen pivot Looks cool! I'm confused what the OV self-multiplication is supposed to do, but ...

I need to work out more on the interpretability: Here is what I have observed cos_sim(W_OV f_i, f_i) tend to be greater than cos_sim(W_OV f_i, f_j) for j not equal i, assuming that features correspond to meaningful directions in the residual stream, the transformed feature by an attention head tend to align more with itself than with other features, my hunch was that W_OV is more similar to a compressing features so that they can be recovered (the recovered feature aligns with the original feature) rather than a self multiplication. Also since W_OV is responsible for copying, for features than align with themselves they might be copied.

#

This shows that the OV is responsible for extracting the attribute for some tokens

#

Atrributes, the way it is defined is taken to be the context for a token, that is very similar to the feature that a token relates, that is just a guess, i am unsure about this reasoning

#

I suppose one intuition or idea is that from the mathematical framework for transformer circuits paper, they put plots of the W_OV matrix eigenvalues. they draw blobs around the eigenvalues that cluster in the positive real direction; having a high CS of (W_OV f, f) means that feature is somewhere in that cluster (maybe not aligning perfectly with an eigenvector but thats ok) and with this library we can interpret it

keen pivot
#

Thinking more, W_OV from residual to residual is like reading from this direction and writing to that direction.

So here, you’re finding directions that the OV circuit reading from means it writes to that same direction.

#

I wonder how you could connect this to real examples with sequence position.

Like suppose these features do copy information from one sequence to another. Can we see specific examples where that happens at specific sequence positions.

#

@bitter turtle for the record, I don’t quite understand the Gini coefficient stuff interpretation, and may be too tired to understand atm. But to give a take:

We want to know some statistic of how much one dictionary matches with another. For MLP-out, we expect lots of matches, much more than random, especially since it’s just an addition with attention.

keen pivot
bitter turtle
# keen pivot <@332271551481118732> for the record, I don’t quite understand the Gini coeffici...

I agree, I'll just throw down my ideas so far:

  • we want a given upstream feature to have a sparse impact on downstream features, i.e only impact a few and not all features uniformly (as random would) -- hence gini coeff stuff
  • MCS is probably not a great indicator here. Many low cosine similarity values probably indicate randomness/unconnectedness. Having a CS of about 0.4 seems to indicate a degree of sparsity in terms of what features impact what other features
  • we can probably also look at the covariance/correlation instead of the cosine sim (I would do this, but for some reason I am getting like 1800 dead features :/ )
keen pivot
bitter turtle
#

I want to sort out the covariance stuff first

#

Also have you got the histogram thing integrated with the new setup?

#

If not send over the code anyway and I'll give a go on converting it

shell mural
#

but also how to track when information is flowing from one feature to another along a qk pair

keen pivot
#

I just trained the 160m & sent it off to Lucia & Lovis. Whooo! My first Trello to-do done. WHooOOOooOOOooOOOoooo

shell mural
#

inspired from a few papers i read recently

keen pivot
#

Nice! Definitely feel free to post intermediate results & half-baked hypotheses here:)

shell mural
keen pivot
shell mural
#

what, as in compute (f_i, W_OV^-1 f_j)?

keen pivot
#

No. F_i is from a dictionary trained solely on the output of attn_out. Then you'll get non-token-level features for sure.

shell mural
#

oh yeah

#

narmeen has just informed me that we already have those dictionaries. so i guess we should look at them lol

keen pivot
#

lololo

pallid current
#

been struggling with the autointerp on mlps all day. very mixed results on the longer runs, usually doing quite well but num dead feats is high even with l1 at literally zero

bitter turtle
#

Like I wouldn't expect any dead features up to ratio 2 but maybe from 2< I would

fading valley
#

Would appreciate it if someone could provide feedback on this idea

For the purpose of an enumerative safety result for reward models learned during RLHF:

  • extract n parameters for which the largest updates occurred during RLHF
  • transplant them to the base model to confirm their applicability in reducing loss on some measure that quantifies the success of your RLHF
  • (assuming the success of the last step:) obtain a representation exhibiting less superposition of some layer like in https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tentatively-found-600-monosemantic-features-in-a-small-lm for both the base model and the RLHFd model (assumedly the difference should at least partially encode information about the learned reward model)
  • try to train an autoencoder to reconstruct activation vectors for the RLHFd model when given activation vectors from the base model as input in order to quantify how good the reconstructions from the previous step were (better reconstructions should be better at predicting the behavior of the RLHFd model assuming the autoencoder is able to properly reconstruct inputs for just the base and RLHF models separately)

Ideally resulting in an understanding of which parameter shifts caused different features to affect the outputs of the inspected layer

prime obsidian
#

@keen pivot Hi! What marc/er mentioned is the project I told you I was working on when we met IRL

bitter turtle
# fading valley Would appreciate it if someone could provide feedback on this idea For the purp...

My initial take is that this would require improvements in the sparse autoencoders we are currently using, we are not currently able to accurately reconstruct inputs to the degree I'd think would be needed for this kind of experiment. I'm also not sure why the final step helps quantify goodness of reconstruction for this, could you elaborate maybe? I'm also not sure how you would be able to usefully extract information out of the difference between the decompositions learned in step 3. Excited to hear more!

keen pivot
keen pivot
bitter turtle
keen pivot
bitter turtle
#

I guess I'm just not sure how robust our learned dictionaries are to like initialisation state etc.

keen pivot
fiery tangle
keen pivot
# fiery tangle Hey, is there a notebook for this [LW post](https://www.lesswrong.com/posts/Q76C...
GitHub

Contribute to loganriggs/sparse_coding development by creating an account on GitHub.

keen pivot
#

Yoooo, ablating the apostrophe feature mostly only effects the "s". (Note: I specifically searched for a feature that activated on the sentence: |Then we went to Dave'|, which is the possesive apostrophe, than other contractions)

keen pivot
#

When working w/ dicts across layers, I noticed the dicts seem:

  1. monotonically decreasing in MMCS (& # of features above 0.9)
  2. monotonically increasing in peak MMCS w/ respect to l1
pallid current
#

layer 5 really seems to be something very different and less suited to sparse coding it seems, always looks v different

#

like the graphs though!

pallid current
#

prob be good to log the xaxis

keen pivot
#

It's weird that last layer peaks when first layer drops. Would need to run on another model before making any conclusions

#

Oh, to clarify: This is MMCS from dicts with their l1, um, "neighbor". So 3e-3 & 4e-3, and 4e-3 & 5e-3.

keen pivot
pallid current
#

doing some graphs of % features active for the appndix

bitter turtle
#

What L1 values are we using for MLP?

#

And what do the reconstruction-sparsity plots for MLP look like?

bitter turtle
pallid current
fading valley
fading valley
bitter turtle
#

Like, have people even successfully done base model activations -> RLHFd model activations with an autoencoder? The final step seems to rely on this being possible.

fading valley
bitter turtle
#

Sure

bitter turtle
#

various activation editing techniques across depth; for the first few layers there are single features we can ablate that basically solve the task, but in later layers there aren't

#

the most reasonable explanation I can come up for this is that since we are ablating sparsely-activating directions, our edits are more 'in-distribution' than e.g. LEACE's edits at earlier layers

#

oh, 'mean' here is 'ablate along the difference-in-means direction when considering all token positions as unique datapoints' which is like laughably untrue but hahahaha idk any other techniques that operate on a per-token-position basis

bitter turtle
#

histograms for various best dict features for ablating for pronoun prediction

keen pivot
# bitter turtle histograms for various best dict features for ablating for pronoun prediction

Good that these make sense for the early layers where it works!

One concern w/ layer 3 & 4 (which isn't much of a concern because they're bad) is that they might be outlier features, which ablating them makes a lot of tasks get worse (one indication is that the top tokens are sort of sorted by token frequency, but you could just do the "visualize feature" function to see if it activates for the first token & first delimiter).

bitter turtle
#

yeah, I can ~tell which ones they are by their activation magnitude, I'll check for you

#

Ok, initialy it doesn't seem like they are

keen pivot
bitter turtle
#

I think for future activation editing investigations (inc potentially investigations to work into this current paper if we still have time after I get back from holiday), it'd be really useful to have a sweep at 4xdict size and some l1 value done of pythia-410m or something of equivalent scale, so I might set one off shortlyish; @keen pivot @pallid current do you have any similar requirements/wants wrt sweeps of larger models?

pallid current
#

i dont have any particular needs but agree that sweeps of larger models would be good

#

my main ask would be to include some quite large dict sizes and make sure to train for a long time to make sure the bigger ones aren't undertrained, to see if we can get a sense for where diminishing returns to size comes in for larger models

keen pivot
pallid current
# keen pivot Like train for 100 chunks, save every 5?

yeah i would want to see the equivalent of 20-30 chunks for 70m so yeah probably up to 100 accounting for larger activations and maybe slower convergence, and making sure that we're tracking mmcs and loss/sparsity over time to see if we're missing out

pallid current
#

been testing some variants of forcing the mlp directions to strictly be in the positive quadrant (and bumping up all the inputs by min(gelu)

#

loss curves are an absolute rollercoaster, and generally terrible in terms of convergence speed but i do kinda suspect there's something there

#

notably, that purple line is 2x overcomplete, competitive in loss (eventually, after being about 10,000x worse for the first 50 epochs 😅) with normal runs, and maintains about 100% live features

#

v possible its learning the identity or some degen solution tho

pallid current
#

turns out it was learning some weird degen solution but one which meant that all the autointerp came out as 'newlines/periods', at least for the nonzero l1 runs

#

much to ponder, but leaving it it for now

keen pivot
bitter turtle
#

@cosmic yarrow that kind of thing is absolutely something I'm interested in doing, feel free to discuss in here

bitter turtle
cosmic yarrow
cosmic yarrow
bitter turtle
#

I mean I'm kind of dissatisfied with the current codebase as-is, I'd like to clean it up/redesign things to have more flexibility in general. I'm also on holiday for about two weeks at the moment, so I won't be working on it in that time. It probably makes sense to integrate it with the current codebase though, yeah

#

I think in general I at least am kind of unsure what will happen after we finalise the current paper

cosmic yarrow
pallid current
#

hey @cosmic yarrow, what kind of experiments are you thinking about doing?

bitter turtle
#

It was looking into reproducing/extending Yun's paper more, either by doing multiple layer dictionaries or using the FISTA/solve dict/basically K-SVD iterative method described in the paper.

gilded merlin
#

How to get involved in this project

cosmic yarrow
# pallid current hey <@846082367974146057>, what kind of experiments are you thinking about doing...

My overall goal is to see if some of the circuit tracing/causal tracing methods can be adapted to explore these dictionaries. The first thing I wanted to do is extend Yun's method by also training a dictionary for the mid-residual stream, right after attention. And then, using the COUNTERFACT dataset, adapt methods like Geva et al https://arxiv.org/pdf/2304.14767.pdf for tracing information flow across the sparsified layers. This is all very hand-wavy for now, apologies if it doesn't make sense.

shell mural
#

oh hey. me and a couple others are thinking about the same stuff. we're diving on mor geva's work rn

#

me, @hallow wyvern, @onyx compass, @frank vortex

#

im setting up notebooks for causal intervention techniques w/ counterfact dataset with narmeen, firstuserhere is getting a head start on geva's work, faulsname spent time a few weeks back getting familiar with the dictionary learning codebase

#

we should chat

cosmic yarrow
bitter turtle
# shell mural im setting up notebooks for causal intervention techniques w/ counterfact datase...

I've got some causal ablation type stuff on my fork of the GitHub repo if that's helpful for you guys. TL;DR of as far as I got is that

  • our features are decent for e.g. concept erasure at early layers (on pythia-70m)
  • we can identify like a single feature which is responsible for IOI (by which I mean 'if you change this feature on the corrupted activation to its activation on the clean data you like 60% of the way to the behaviour on the clean data')
  • you almost certainly want to use a dictionary set for a model better than pythia-70m lol
#

Also if I were to do it again I would mean-center the activations for every layer

#

You also probably don't want to directly convert ACDC, the amount of graph edges is absolutely absurd for even moderate dictionary sizes

#

Oh! Also I ran into a bunch of annoyances with imperfect reconstructions @cosmic yarrow absolutely forsee FISTA at least partially solving those. I would train with FISTA like Yun et al if I were to do it again

cosmic yarrow
gilded merlin
#

someone share the github link of this project, i have done some work in interpretability area, the project seems interesting and i want to work on it if there are some ideas to be tested

keen pivot
#

Though we should update the readme.

#

I have loads of ideas to test! Let me get to the office and I can send a list.

gilded merlin
#

okies, mean while i get a hang of the project repository

bitter turtle
#

We got these kind of curves in terms of activations-per-example against reconstruction loss.

#

FISTA should converge to better (more exact) solutions in the highly sparse regime, is my thinking

#

You can squint Very Hard and see our encoders as kind of being almost a single iteration of (L)ISTA, to give you an idea of how 'good' our encoders are

cosmic yarrow
keen pivot
bitter turtle
keen pivot
bitter turtle
#

I think that failed because of accursed convergence/not enough iterations etc etc.

keen pivot
bitter turtle
#

I think FISTA with our current dicts would be good, less sure (but still pretty sure) about dicts trained with FISTA.

#

Oh lol

keen pivot
#

I was kind of expecting doing the KL thing to close the gap (at least for functional equivalence), but a better solver would be complementary to this.

bitter turtle
keen pivot
pallid current
#

i still do have some feeling that our autoencoder methods should be helpful in that they should more closely track which features can be easily pulled out by a linear layer. though tbh the residual stream has so much capacity that it's quite likely this is barely an issue

bitter turtle
bitter turtle
bitter turtle
bitter turtle
pallid current
#

btw one thing that i think we've underexplored so far is whether there's actually a maximum number of features that we tend to find. like we often see that with mlp we don't see the largest ratio having the larger number of active features. curious if that's eventually also the case with residual stream at some size, like it seems d(active_feats)/d(num_feats) is declining at 32x, some layers more than others, but we havent checked if it ever hits 0 for residual stream

#

running 64 and 96 on gpt2sm to test this, though should prob go back to pythia70m because the extra dims hurt a lot for this

keen pivot
# bitter turtle Yeah, just linear decoder, what was the 'Hessian thing'?
GitHub

Contribute to zeyuyun1/TransformerVis development by creating an account on GitHub.

GitHub

Contribute to zeyuyun1/TransformerVis development by creating an account on GitHub.

bitter turtle
#

Oh, I think that's just how they optimise the dictionary.

keen pivot
#

@gilded merlin, just my general list of things-to-do:

  1. Learn circuits for many target tasks: adversarial examples, chess/othello, in-context learning, deception, truthfulness, sycophancy, etc.

Like the dictionaries aren't perfect atm, but trying to extract circuits from real things will still inform what heuristics to use (then better dictionaries can be slotted in later).

  1. Better dictionaries: FISTA stuff above (& I want to chat w/ some Harvard people here who do dictionary learning as their research once our paper's on arxiv) and KL-divergence penalty (this is what I'm doing this week).

  2. Activation engineering using our learned dictionary features

  3. Better automatic circuit detection (A) how to go forward (e.g. layer 3->4), (B) connect w/ dictionaries learned on MLP-out & Attn-out, (C) weight based connections (ie features in residual connect to the features in MLP, and weights connect them. Can you predict this from just the weights?)

  4. Connecting circuits learned w/ datapoints. How do datapoints lead to learned features/dictionaries? This could even be paired up w/ developing Deep Learning theories since it's easier to develop theories when you have specific examples of circuit-formation.

  5. Refactoring code for my manual interp stuff & make a standalone colab notebook that can load in a dictionary from e.g. hugging face

  6. Optimization help: code in perplexity check every N batches, fix wandbd display (or some equivalent), have a changing l1 value to specify a set sparsity (ie features/token)

  7. Updated Github w/ minimal code to run on a new model & look at it in details.

keen pivot
#

For this week, I'm going to get that KL thing working & get the perplexity check coded up as well.

gilded merlin
#

@keen pivot does the experiment in the repository can be easily run on a colab

gilded merlin
#

which of above tasks can be done on colab

keen pivot
#

It was possible at one point, and it's on the to-do to make them so

#

A colab notebook is going to not be so great because you need other files.

#

I think (6) is the one for this then

gilded merlin
keen pivot
#

But the current repo is optimized for like multiple GPU's and having the repo loaded, which is doable to put in a colab

#

But will be difficult for pushing useful PR's

#

You could convert this file: https://github.com/loganriggs/sparse_coding/blob/main/lucia_and_lovis.ipynb

To a notebook. You need to have the dictionary loaded from pythia160m, and the autoencoders folder from: https://github.com/HoagyC/sparse_coding

and I'm unsure how to import that to a notebook.

GitHub

Contribute to loganriggs/sparse_coding development by creating an account on GitHub.

GitHub

Using sparse coding to find distributed representations used by neural networks. - GitHub - HoagyC/sparse_coding: Using sparse coding to find distributed representations used by neural networks.

gilded merlin
#

so task 6 is for demonstration easiness basically

keen pivot
#

Yep

#

I will say vast.ai is quite cheap compute & only takes a day or so to learn.

#

It would help!

gilded merlin
#

Will start on that, if any problems come i will ping you

#

Have you seen this paper, in above list you mentioned learning circuits for target tasks i.e in-context learning

#

this paper might give some insight or food for thought regarding in-context learning, i read it recently

bitter turtle
pallid current
#

finished now (for a couple of layers), havent had a chance to look at results yet, will v soon tho, currently getting the proper results for the correlation btwn interp scores and e.g. kurtosis and skew. took longer than expected cos needed a batched version of the moment calculations which was a little awkward

pallid current
#

now rnning results, running it for lots of different chunks so its mega slow, results in about an hr

pallid current
#

hmm the number of feats just keeping growing

#

will switch to pythia 70M and keep cranking it up

#

might have to keep turning the batches up as well, this is 32 chunks, and the time taken to converge to this shape gets higher as you increase the size

#

worried that the 10chunk 32x dicts were a bit undertrained

#

now setting off 32, 64, 128 and 256 on pythia70M for 64 chunks. time might be way too long and might have to settle for a single layer, results are generally v consistent across non final layers

#

hmm eta like 5 days :(( will restrict to layer 2

#

still will have to wait like 2 days for long enough 256x results

#

remind me on wednesday morn to check the 256x results lol, other ones will be done by morn

bitter turtle
pallid current
#

got some level of results from the larger dicts sizes. not seeing any reduction in the ability of the model to incorporate more and more features, even as the % of active features keeps falling. these result are a bit odd in that they dont show as much of a dropoff towards high or low values, which we see both the gpt2sm just above and also in the previous pythia70m runs.

#

256 is almost certainly undertrained (still running) this is after 32 chunks

bitter turtle
#

How do they look on the reconstruction-sparsity plot

pallid current
bitter turtle
#

Eel

#

Eek

#

Take the mean over multiple batches or something?

pallid current
#

sorry went for lunch but i can just take the sample size down

#

results are really annoying though bc they dont match up with the previous small batch runs

#

pain to see but there's no improvement at these sizes. you can see that 32-128 are on top of each other and 256 is undertrained

keen pivot
#

Is that l1 in the legend? Edit: Wait it can't be cause we have different sparsities. What's in the legend? @pallid current

pallid current
#

really dont understand why sparsity isnt improving

#

legend is ratio and the total number of feats

#

from 1k to 131k

bitter turtle
#

Are these tied or untied?

bitter turtle
pallid current
pallid current
bitter turtle
#

That was LISTA and that's because it converged badly

#

Haven't actually tried regular non-neural-net FISTA

pallid current
#

i dont understand why these larger models - with many more active features! - dont increase the bias and become more specific though

bitter turtle
#

Well, I guess if they did too much they'd have really shit reconstruction, because we use a ReLU for thingy

pallid current
#

yeah but in that case we don't have a mechanism for why they would even bother turning on more features. like if it's not improving reconstruction loss, and there aren't fewer features active per token, then that's all of the losses so what's the value in these extr` feats?

keen pivot
bitter turtle
#

Also think we should try this type of activation out further

pallid current
bitter turtle
#

Ye exactly

#

I think I have a thing on my branch/maybe your branch called Thresholding or something

keen pivot
#

The KL stuff is going well. I think we can tradeoff some reconstruction loss for KL/perplexity, which I think is what we really want (ie features that have intuitive, causal effects)

bitter turtle
pallid current
keen pivot
#

Note: I need to add to my future work list: monosemanticity metric.

If we had a dataset that had very basic features (maybe just token level) and made a metric on how monosemantic it is (maybe defined by a weighted histogram measure?), this may also inform how monesemantic other types of features are.

It would also be a cheap test for checking if our dictionaries at different sparsities/hyperparams are more monosemantic

pallid current
#

made a few notes from the textbook im reading now of things particularly relevant to sparse coding, welcome to read here, super messy, will prob keep adding to it https://docs.google.com/document/d/1MzSS2EFXtva5uTWxl7KGjSs_t3Z5mv7yoM2bMsZjATs/edit?usp=sharing

bitter turtle
#

I think I also want to scale up the concept erasure stuff to a more capable model, do you think we could do a run on pythia-410m? At maybe like 4x dim size and a few L1 levels over all layers?

#

Or, if not all layers, maybe every other ome

pallid current
pallid current
#

ok currently running 8 l1 sweep of pyth410, 80 chunks, layers [0,2,4,6,8,10,12]

keen pivot
bitter turtle
#

I think maybe we should check out using VAEs (or most likely, something similar with also-useful subcomponents) for this, like we might be able to achieve more robust/in-distribution activation engineering by sampling from e.g. the latent distribution conditioned on the 'deceptiveness' direction being higher than some amount (obviously an oversimplification, you'd also want to preserve other properties of an activation if you were to edit it, and you'd probably find directions in latent-space to preserve semi-autonomously)

(Also have some friends in Bristol that have been maybe been wanting to work with this for a bit, @thorny cypress and @coarse flint)

#

I can't actually think of that many benefits that doing a search over conditioned latent space has over searching over e.g. sphered data for a point with a certain magnitude in a certain direction and minimum distance to some other target point like you might do in concept editing

#

Unless, like, "covariance doesn't capture sparsity well" or something

bitter turtle
#

Lol #1146607658179252286; Time to read all papers posted in this channel ever

bitter turtle
#

Right it seems the exact thing you want here is the normalising flow autoencoder that @opal basin mentioned ages ago.

opal basin
#

ohh?

#

i never did try guided sampling with it

#

but it might work!

#

also you can try training a diffusion model in a normal autoencoder/low-beta VAE latent space to sample from it and guiding the diffusion model sampling process to update the prior from the diffusion model with the evidence from the criterion, guided sampling totally works for diffusion

glass tinsel
#

wait did you guys do the end to end training yet

glass tinsel
# bitter turtle Right it seems the exact thing you want here is the normalising flow autoencoder...

have you seen the BERTflow paper https://arxiv.org/abs/2011.05864

bitter turtle
keen pivot
glass tinsel
#

who cares about reconstruction loss 😛

keen pivot
#

I haven't checked the individual features that are different between them.

#

One general problem for our work is determining the "goodness" of the model.

I'm currently working on an automated monosemanticity metric for both the input activations & output. It's only on single-token features, but monosemantic on single-token-level features may imply monosemantic on other features.

I'm hoping this helps us actually measure what dictionaries are "better" or not.

bitter turtle
# keen pivot Yep! Very basic result: The dictionaries have much better perplexity for the sam...

Hehe I'm still predicting significant perf improvements (on both perplexity and reconstruction loss) when training better autoencoders, ours aren't even capable of accurately reconstructing data at the moment. I do worry that we'll get some accursed less-monosemantic solution if we switch autoencoder though, maybe something about the current setup incentivises monosemanticity compared to using FISTA for a given sparsity level

keen pivot
bitter turtle
#

Yes it's kind of a small worry, but I'm just slightly skeptical of the 'sparsity induces monosemanticity' story ATM I guess, partially because of the fact that bigger dicts don't work as well as you might expect ( @pallid current have you compared 'mean autointerp' or some weighting of that between dictionary sizes?)

pallid current
#

yeah on the pythia70M i ran autointerp over different dict sizes, it's in the drat appendix atm

#

general pattern was for the first couple of layers, interp scores didnt changes with dict size and for middle-latish the interp generally got worse

#

which is correlated with there being clearer improvements in sparsity/fvu when increasing dict size for those early layers, while the improvement is minimal from about layer 2

keen pivot
bitter turtle
#

I now retract that, I got confused.

rancid summit
#

are there plots of the pre activation distribution of values for some arbitrary autoencoder neuron, along with the negative biases (ie where the ReLU truncates it)?

bitter turtle
#

Not currently, we should definitely do that.

#

OpenAI has a different approach for finding directions that also used that visualisation, it'd be interesting to compare the location of our negative bias to theirs, however they set it.

bitter turtle
# keen pivot Can you explain the connection with bigger dicts bad -> sparsity induces monosem...

Ok, solid claims

  • I think some of the 'sparsity induces monosemanticity' phenomenon is 'our measures of monosemanticity are slightly gameable by sparse activations'
  • I think that switching to FISTA will initially reveal solutions with higher sparsity and reconstruction accuracy, but lower monosemanticity, because FISTA is just a much better 'encoder' (similarly to the topk-pca thing from before)
bitter turtle
rancid summit
#

no worries

keen pivot
#

First attempt at a monosemanticity measure. The peak is at 3e-4, which doesn't seem right to me. Though I think I'm too hungry to explain what experiment I did specifically

pallid current
#

token entropy or something?

keen pivot
#

(Okay, getting food in 5 minutes)

I'm measuring monosemanticity on a token level, and looking at the features that activate for single-tokens only (e.g. periods, newlines, commas, etc). I assume that all LLMs will dedicate features to these, even if you scale, you'll get feature splitting (ie a feature that activated for all periods now splits into two features that activates for subsets of periods).

So I find these features & count the number of tokens they activate on for that single token, divided by total number of non-zero activations. Weighted means I account for how much that feature activates (e.g. "." with activation 8).

keen pivot
#

Okay, for next todo's:

  1. Verify these single-token features are indeed monosemantic in the low-l1 regime (in case of bugs in code)
  2. Check other features to see if they're monosemantic across l1's

Would be nice: have guaranteed features in a slightly more complex model (like TRACR code in superposition decomposed), then we can check if those features are monosemantic for a given l1 value.

pallid current
#

got centered data running, trying an r4 l3 run at the moment

polar violet
#

logan was v cool irl

#

if people are curious

#

v chadgoose energy

bitter turtle
#

lmao

bitter turtle
pallid current
#

im centering it when running setup_data. also normalizing variance. currently the means/stds aren't saved anywhere which needs to change but they're just the means and stds of the first chunk

bitter turtle
#

As in, proper sphering/whitening? Are you decorrelating the data as well @pallid current?

pallid current
bitter turtle
#

Hmm

#

I think you should probably be decorrelating as well. You can use the BatchedPCA to implement efficientish sphering

pallid current
#

i'm open to trying it but don't understand the transformation well enough to have a feel for how that would interact with the features

#

gonna have to mostly pass it onto you, i'm on holiday from tomorrw and its MATS final presentations today

keen pivot
# keen pivot Okay, for next todo's: 1. Verify these single-token features are indeed monosema...

It is true that these high sparsity (e.g. 350/500 d_model) dictionaries have single-token level monosemantic features. It is also true that many, many other features are polysemantic. I've now got a few more ideas:

  1. Different monosemantic datasets - build a dataset of a feature that's more complex than single-token level. single-token level may be low-hanging fruit for the model (especially since I'm doing quite common words). So doing more complex features may be a better measure of monosemanticity in general.

  2. Measuring how much a feature is just copying a dimension in the residual stream - residual stream is (mostly) polysemantic. So we can figure out how much a feature's [activations/variation] can be explained by 1 neuron basis element (I feel like there's an established way of doing this, but not familiar with it). Additionally, instead of looking at 1 feature at a time, we could look at specific datapoints & see if their encoding is more like an identity or not.

  3. For a given sparsity, S, that means a token has S features activating on average. We could qualitatively examine a few example sentences to find the features that are most reconstructing it. e.g. For the sentence " Of the 5 donuts, he ate all 5", we could find all features that activate for the last token (i.e. " 5"), and make statements like "30% of reconstruction is 'single digits' feature, 25% is 'repeated token', etc". Better dictionaries will just make more sense here.

bitter turtle
pallid current
#

hmm dyou reckon we should either just mean center or fully whiten then? i was gonna just center to begin but then i thought about those outlier dimensions and added the stds

bitter turtle
#

I mean I'd probably want to try both tbh

pallid current
#

also the run failed at like 64 epochs for some reason but there's dicts in /mnt/ssd-cluster/pythia70m_centered/.../_63 if anyone wants to check on them

bitter turtle
#

Could you run one on just mean-centered data before you go on holiday I guess?

pallid current
#

yup can do

pallid current
#

not sure, no proper error, it just kinda hung, said something about a wandb network error but not sure if that's symptom or cause

pallid current
pallid current
#

speaking of slow feedback loops, did anyone look at the gpt-2-small results? they look really good!

#

clear gains up to 96x ratio, and look at the y-axis - it starts at 0.02!

bitter turtle
#

That the fuck

#

This shit needs reporting and investigating what

pallid current
#

i know! im pretty shocked

#

need to get the interp on it asap

#

cant believe im gonna go on holiday to NYC instead of grinding on this 😭

bitter turtle
#

Initial hypothesis

  • pythia is incredibly uncentered, gpt2smal isn't maybe? This seems easy to test, just look at the means of everything
#

I remember Neel Nanda saying something about this, this seems low-cost-to-test-and-potentially-very-high-value

pallid current
#

yeah hang on wasn't there some plot of the centeredness of different modelss

bitter turtle
#

I vaguely remember looking at pythia s and it wasn't, might be hallucinating tho

#

I'm having strong words with my past self if the whole problem was mean centeredness

pallid current
#

yeah me too this is a big mad

#

also looks like the mlp results are helped by it

bitter turtle
#

That doesn't even make any sense 😕

pallid current
#

this is latest result, 4x ratio

bitter turtle
#

Time to rewrite everything

pallid current
#

this is previous, mixed ratios

bitter turtle
#

How long you in nyc for

pallid current
#

not as stark but still big diff

#

10 days

bitter turtle
#

A rewrite before ICLR is possible then maybe

pallid current
#

yes for sure

#

we know what to do

bitter turtle
#

Phew

bitter turtle
pallid current
#

Yeah agreed from a theoretical perspective I don't understand it in the mlp, except maybe just to say that the mlp may be used to perform arbitrary function approx that isn't very tied to the neurons themselves, and this still exhibits a sparse structure

#

But I think that's mostly already true for our normal sc on mlps

#

Will have to actually interp and see

#

set off a run at ratio [4,8,16,32] residual stream centered all layers

#

still unit variance but not decorrelated (soz)

bitter turtle
#

I'm kinda worried that doing unit variance without decorrelating with squish important info and maybe harm perf possibly idk

pallid current
#

will have a quick look after the first model is done and see if it looks decent, will cancel and rerun without altering variance if not, otherwise will try set off in a couple days or maybe airport lol

bitter turtle
#

Enjoy NYC btw

plucky bay
bitter turtle
pallid current
#

ok got some results back from large centered runs (including unit var, not fully whitened) on pythia70m, first impressions are that it's not doing anything too crazy

#

even on l5 surprisingly

#

i think the actual takeaway from the last day or two might be that gpt2sm >> pythia70m

#

plane is boarding now no time to test whitening but set off a run without zerovar just to compare

#

priority should be to run more tests on those large gpt2sm dicts tho imo

bitter turtle
#

Second hypothesis: something something outliers. Do they differ significantly between the different models?

keen pivot
keen pivot
#

They both have outliers, but I think Pythia also had them for the first delimiter (eg period/newline) but not GPT small

bitter turtle
#

What about their magnitude

#

I guess I'd also be interested in FVU against sphered when trained on sphered

pallid current
#

WAIT ughhh deleted the graphs above they are bs, i didnt pass the argument to the HF activation function only the baukit one 🤦‍♂️

pallid current
#

norm(mean) by layer:

#

still worried i've done something wrong somehow but redone it with centered data (no unit variance this time) and not seeing any diff in l5 perf

keen pivot
bitter turtle
#

Maybe crisis of confusion averted

#

Still think that philosophically speaking we should be centering/allowing learned center points but initialise to mean-center

keen pivot
#

I did reconstruction, not FVU, but they should be similar

pallid current
#

@keen pivot hahaha wait how? I can't go back at check the code rn but i dont even know what bug would get the fvu wrong

#

Unless the fvu isn't normalized properly when switching to gpt2sm?

#

Tho I think the most interesting thing was the fact that 96x was better than 64x, do you still see that?

keen pivot
bitter turtle
keen pivot
#

Oh, I'm not testing on data it was trained on, so maybe!

pallid current
#

Hmm yeah layer 0 and 1 so yeah not that surprising for layer 2 gpt2sm tbf

keen pivot
#

I'm doing layer 4

keen pivot
pallid current
#

Original was 2 I'm pretty sure, there's only 1 layer which has the big dicts i think?

keen pivot
#

Not the best graph, but here's perplexities for layer 4

#

Oh note: I don't think the first column is original, because it shouldn't change I think (unless I'm shuffling data)

pallid current
#

Wait so did you try to replicate my original L2 Graph?

keen pivot
#

I’m hoping tomorrow we’ll get it figured out

pallid current
keen pivot
#

Figured it out: gpt2 has large activations, so variance is large, so FVU is smaller (since FVU = MSE/Var)

bitter turtle
keen pivot
#

I just got the batch at layer N, and took the median

bitter turtle
#

Elementwise?

keen pivot
keen pivot
bitter turtle
#

Ok, so I'm quite confused as to why that's on the graph? Like, what are you trying to show with it? (Just slightly confused 😅)

keen pivot
#

Just general statistics. I don't know what I'm doing

#

But ya, I think max-activation -> high variance is the thing

bitter turtle
#

Like, maybe get a batch of pythia data, scale so that mean pythia activation mag = mean gpt2small activation mag, and see what happens to the FVU

keen pivot
#

Like the difference between a variance of 35 & 0.5 is 70x, which I think fully explains the FVU difference

keen pivot
bitter turtle
keen pivot
bitter turtle
#

What the hell

keen pivot
#

You're right. Hoagy's results are more like FVU/20

bitter turtle
#

Oh I misread this

#

Could you plot the unscaled things as well?

#

Wait, are you retraining?

keen pivot
#

This is just the ratio 6 pythia one in /mnt

bitter turtle
#

Wait, what are the lines here?

keen pivot
#

FVU of pythia-70m layer 2

bitter turtle
#

But what are the different lines?

keen pivot
#

FVU, FVU/20, FVU/70

bitter turtle
#

Ok, so I think maybe you misunderstood what I meant;
I'm confused as to why FVU would change so much with activation mag, so I wanted to test this by training autoencoders on scaled activations but the same underlying dataset, and see if anything changes

keen pivot
#

I should be able to plot MSE from the pythia one & gpt2 small & gpt2 small will be 3x better.

keen pivot
bitter turtle
keen pivot
#

Oh, I think I have access to the scaled Pythia ones

keen pivot
bitter turtle
bitter turtle
#

Sorry for the confusion 😅

keen pivot
keen pivot
bitter turtle
bitter turtle
keen pivot
#

I'm not confident in me quickly setting off a run making it uncorrelated

bitter turtle
#

I'm back home this evening, I could set one off maybe

keen pivot
bitter turtle
#

Yeah

keen pivot
bitter turtle
#

What are the axies here?

keen pivot
#

They both have only like 7-10 extreme values.

keen pivot
bitter turtle
#

Ok! So this seems like a significant difference. Pythia's outliers are waaaay smaller in terms of standard deviations outside mean

keen pivot
bitter turtle
#

Sure, something like that.

keen pivot
#

So no holy grail by just centering data?

bitter turtle
#

Did you find perplexity changes significantly between gpt-2 and pythia?

bitter turtle
keen pivot
bitter turtle
bitter turtle
pallid current
#

So hard to really read on diff graphs with diff axes

pallid current
#

Those fvu vs sparsity graphs you linked that I sent a couple days ago

bitter turtle
#

I can do a comparison of KL-div under various transformations (mean-centering, sphering) on pythia when I get back

#

Broadly think KL-div is the thing we should focus on minimising

keen pivot
keen pivot
bitter turtle
#

Trouble is, I also don't know what the reasonable comparison for relative perplexity diff between models is ://

keen pivot
#

@bitter turtle, pretty big differences!

#

Oh, nevermind. I'm such a loooser. I need to compare by sparsity

bitter turtle
#

Lol

#

Also put 'original' at the other end maybe

#

(purely aesthetic request)

keen pivot
bitter turtle
#

I mean the objectively correct thing to do would be axhline or something

keen pivot
#

Oh ya, that works too

keen pivot
#

Okay, I think I want to plot by both sparsity & sparsity/d_model, but I expect them to mostly be the same

#

Looks good!

#

Like very good!

#

This was run on ~250k tokens for calculating perplexity

#

Note on KL: how is the model normally trained w/ EOT tokens? Is it masked? Does this effect how we'd want our autoencoders to have low KL-div with the model, or is this a total nothing-burger of a concern?

#

I could also look at datapoints that the reconstructed model is worse at predicting & see if there's some statistic that separates them. For example, maybe it is mostly high activating datapoints.

bitter turtle
# keen pivot

ok this is like the opposite of what I expected maybe? kind of implies we shouldn't be sphering things

bitter turtle
# keen pivot Can you elaborate?

If we sphere, it downweighs the importance of the outlier dims wrt MSE, and maybe the above graph shows that outliers relatively more important since the GPT2 one preserves them better or something?

#

Obvs will actually test

keen pivot
bitter turtle
#

Yeah when I set it off I'm going to log that

#

Pbbly tmmrw got home too late

#

I'm going to approximate by 'FUV for largest activating component'

bitter turtle
#

If we're looking at this from the perspective that L1 acts to disentangle latents maybe it'd be interesting to implement something like https://arxiv.org/abs/2205.05862 with a sparse autoencoder

#

Relevant tldr diagram

bitter turtle
#

tbf I haven't tried 'KL under reconstruction' before so not sure what I should expect

#

this is also layer 5

#

so maybe thats weird

bitter turtle
#

yeah looks like a product of being layer 5, this is layer 2

#

not seeing too much difference between centered vs not

keen pivot
bitter turtle
#

not sure what you're addressing here (or what 'this' here refers to)

keen pivot
#

Since centering didn't replicate, there may be other statistics of gpt2 data that are responsible.

bitter turtle
#

yeah im like 80% sure it's the rel size difference of the outliers

bitter turtle
keen pivot
bitter turtle
#

yes, haven't got around to that yet

keen pivot
#

I think I could do one on the perplexity

#

Haven't thought through the experiment

bitter turtle
#

I'm currently just seeing the correct L1 range for the sphered data

keen pivot
keen pivot
#

Like that's what you're currently working on?

bitter turtle
#

figuring out/looking for

#

no, I'm currently working on getting

keen pivot
#

Oh, ya.

bitter turtle
#

yep

#

lol

keen pivot
#

You just do the sparsity?

bitter turtle
#

wdym?

keen pivot
#

Like select l1's to get a features/datapoint (ie sparsity) between 5 & d_model.

bitter turtle
#

oh, yeah

#

oh, I changed something and now it's the same as the other ones, duh

#

I'm dumb

bitter turtle
#

relevant info; seems about the same weirdly

#

this is layer 3

#

unsure what's going on with sphered data in the low-sparsity setting, perhaps they are undertrained

keen pivot
#

the sphered one looks like layer 5.

bitter turtle
#

maybe 'approximately equal' was a bit extreme 😅 but I think they look similar at least

#

I would maybe put this as "not conclusive but decent evidence that it's mostly the outliers"

#

this is gpt-2-small, running pythia-160m which should be more equivalent (in model size) now:

keen pivot
#

What's the graph? You say

this is gpt-2-small, running pythia-160m
, so is the graph gpt2 or pythia160m?

bitter turtle
#

oh, this is gpt-2 small

#

pythia-160m 🤔

#

I'd say there is definitely some other structural difference here then

#

(in addition to the outliers being vastly more significant compared to the norm)

bitter turtle
#

I don't think this is a very informative graph, but

#

gpt2-small

#

pythia

#

contribution looks ~constant by sparsity

#

basically shows that 'if you only allow the top-2 directions (the outliers), gpt-2 has better performance than pythia' which would be evidence for 'gpt-2 is better because it can proportionally represent outliers better'

#

@keen pivot

keen pivot
bitter turtle
#

they have the same axes, but yees?

keen pivot
#

Wait no!

bitter turtle
#

????

keen pivot
#

Okay, I see that mean centering makes it suck

#

for both

bitter turtle
#

I think this makes sense, like outliers were typically very positive or very negative, and not both, so they can't nicely be captured by a sparse code which is mean-centered

keen pivot
#

How does centering relate to outliers? If you center, then outliers contribute less to variance?

bitter turtle
#

well, basically I don't think the outlier dims are mean-centered particularly.

bitter turtle
#

well then

#

mean centering is 'wrong' for the outliers

keen pivot
#

And doing this to other dimensions has little effect?

bitter turtle
#

doing what to other dimensions?

keen pivot
#

mean centering

bitter turtle
#

seemingly overall it has little effect

#

overall I think mean centering is 'closer' to the correct value

#

hold on, median-centering might be closer to being correct

keen pivot
#

I did try learning dictionaries that didn't include the outlier dimensions, & they didn't have much improved performance.

#

I should get on the perplexity-but-exclude-outlier-dimensions things

keen pivot
#

No big difference when removing top-2 outlier dims

#

I mean, a big difference for the really awful l1s' for pythia

keen pivot
#

And zooming in:

#

And for sure, the perplexity-diff goes to 0 if you use my code and find the top 500 outlier dims, replacing them, so that probably works.

#

So outlier dimensions don't explain the difference in perplexity

bitter turtle
#

Could you summarise your interpretations of these plots?

keen pivot
# bitter turtle Could you summarise your interpretations of these plots?

Maybe the difference in perplexity is because gpt2 better reconstructs its outlier dimensions, so let's just run the perplexity-under-reconstruction both normally and when "carrying through" 2 outlier dimensions (ie just replace the reconstruction of the outlier dimensions with the actual outlier dimensions).

If the "carry through top-2 outlier dims" one makes the perplexities match more between gpt2 & pythia, then the cause is the outlier dimensions.

In the graph above, it doesn't really make a difference at all, so the outlier dimensions are not the cause

bitter turtle
#

Hmm, I don't think GPT-2 "better reconstructs" it's outliers necessarily, I'd expect e.g. FUV on only the outlier dimensions (note that this is different to what I tested before, before I tested 'FUV on the entire thing, but only allowing outlier features to be active') to be about the same between the models.

#

I just think that both dicts learn to predict the outliers fairly well/almost perfectly, since there's a strong incentive to do so compared to all the other dimensions. Then, GPT-2 gets better overall FUV because more of the norm is in something it learns perfectly, the outliers, hence the lack of a change in your plots.

#

@keen pivot

glass tinsel
# keen pivot

sorry stupid question is this using the L2 loss or the cross entropy loss? if you train the autoencoder to minimize CE it should just automatically do the right thing with the outliers right?

bitter turtle
glass tinsel
#

ok

bitter turtle
#

erasure across depth scores for 410m, slightly wild

#

I'm still concerned I'm using LEACE unfairly, I might try and 'fake' a more in-distribution dataset by prepending random samples of the Pile or something.

bitter turtle
#

runs using 10-shot and 6-shot prompting respectively. (the one on the right, where LEACE is worse, is 6-shot prompting)

#

Example prompt so you get a feel for what I'm doing:

My name is Connie and I am a female. My name is Eva and I am a female. My name is Mary and I am a male. My name is Paris and I am a female. My name is Jamie and I am a female. My name is Dorothy and I am a female. My name is Edison and I am a male. My name is Alex and I am a male. My name is 
Maurice and I am a male. My name is Cary and I am a male. My name is Ana and I am a[completion]
#

kind of spooky dataset tbh

keen pivot
keen pivot
bitter turtle
#

what

keen pivot
bitter turtle
#

checking dataset rn

keen pivot
#

I mean, Mary can be whoever they want to be, but like stereotypes, you know?

keen pivot
#

@bitter turtle, making a circuit for a few layers from the feature you found for gender would be cool. Like I can manually interp the ones in layer 8 which seem to do good, & back-chain to previous layer features. I could also do this for mlp_out & attn_out.

bitter turtle
#

kl divergences for ablations on (presumably flawed 🤔) datase

bitter turtle
#

rerunning

glass tinsel
bitter turtle
#

(except doing concept erasure at a specific layer, and not concept scrubbing)

glass tinsel
#

got it, what concept are you erasing?

bitter turtle
#

gender prediction from name

glass tinsel
#

got it. how does this work for the dictionary feature thing?

#

like you just locate the feature most correlated with gender?

bitter turtle
#

no, it's stupider than that; filter for features above a freq. activation threshold, and compare their erasure ability on a test dataset

glass tinsel
#

what is erasure ability? like fitting a probe on the post-erasure activations?

#

and seeing the loss

bitter turtle
#

sorry, end-to-end model score on the test dataset

glass tinsel
#

oh ok so the model itself is prompted to predict gender and then you also try erasing gender from the activations and see how bad that makes the model?

glass tinsel
#

Also how do you actually perform the dictionary-based erasure

bitter turtle
bitter turtle
glass tinsel
bitter turtle
#

yes

glass tinsel
#

in the like, activation space

bitter turtle
#

yes

glass tinsel
#

not in some higher dimensional space in which the overcomplete basis is orthogonal

#

okay

glass tinsel
glass tinsel
#

So I'd first want to investigate this issue

bitter turtle
#

yes, I was very confused by that.

glass tinsel
#

And once you've sorted that out, if LEACE still has a smaller effect on perf than orthogonal projection, then I would say that's expected

#

And the extra causal effect of orthogonal projection is due to side effects

bitter turtle
glass tinsel
#

like you fit the eraser on one distribution and apply it on another?

bitter turtle
#

ah, ok, no that shouldn't be happening

glass tinsel
#

ok

#

how big is the dataset

bitter turtle
#

1000 prompts or so

glass tinsel
#

I would do a sanity check with setting method="orth" and affine=False on LeaceFitter because in theory that should give identical results to your Mean method

bitter turtle
#

yep

glass tinsel
#

I should note that while we did do one of these destructive intervention experiments in the LEACE paper

#

I think that the gold standard should really be to achieve fine grained control over model output

#

That's much harder to achieve and also more practically useful

#

In particular I think evaluating methods by how much they screw up the model is sort of perverse; in the LEACE paper we actually do the opposite (lower perplexity / KL is better) since that suggests better surgicality

#

There are lots of ways to screw up a model; trivially you can replace activations with constants or i.i.d. noise

bitter turtle
glass tinsel
#

lower perplexity is better?

bitter turtle
#

I'm evaluating KL divergence from the base model under the fitted intervention on a subset of the Pile

glass tinsel
#

mmm

bitter turtle
# bitter turtle kl divergences for ablations on (presumably flawed 🤔) datase

@glass tinsel this is KL-divergence on the Pile under the learned interventions (ignore LEACE probably); the idea was that since our directions are sparsely activating, hopefully we might see that ablating them brings the activations less out-of-distribution, and so we see less KL divergence. However, it seems to be very dataset-dependent

glass tinsel
#

hmm yeah I think you always need to be looking at both effectiveness and surgicality simultaneously

#

bc the identity intervention achieves perfect surgicality

glass tinsel
bitter turtle
#

to clarify I'm not actually that hopeful about this actually being a useful concept editing method, but was just trying to illustrate a possible application/evidence for the learned directions being 'meaningful'

glass tinsel
#

hmm

#

See I could imagine that this might actually be a useful method

#

the end to end version

#

not the reconstruction loss version

#

bc the end to end version is taking into account which directions are most important

bitter turtle
#

agree

#

the autoencoders are not good enough for that yet, though (they are fairly lossy), and I don't really see the benefit of this over e.g. activation editing + good understanding of transformer activation pdf for not-too-insanely-ood activation engineering

glass tinsel
#

how are you planning to make the autoencoders better

bitter turtle
#

I think FISTA for actually good and not terrible sparse coding given a learned dictionary is probably the safest bet.

#

you can do something like K-SVD to optimise dictionaries using FISTA as an encoder

bitter turtle
#

fixed the distributional shift (I was ablating from all token positions, including the prompt), and we get these results.

keen pivot
bitter turtle
#

weird mix of reasons, mostly because it shifts activations ood

bitter turtle
glass tinsel
#

If you want to go hardcore you could do Quadratic LEACE. I'm actually kind of curious how big the edit magnitude would be. Haven't gotten around to measuring that for the paper yet.

bitter turtle
keen pivot
bitter turtle
#

The surgicality is, yes, but you can't reasonably conclude it's not doing something spurious without transfer

keen pivot
bitter turtle
#

no, this is KL on pile 10k, and transfer would e.g. be using the same interventions to measure perf change on a different but correlated task

keen pivot