#Sparse Coding
1 messages · Page 5 of 1
I thought you were looking for similarity between the bases for different layers; I think I'm very confused at a high level here about
- what you are trying to do
- how you are trying to do it
My bad for poor communication lol
no problem
At the highest level I'm aiming for much stronger circuit analysis
And strong analysis of complex circuits
What this might look like is you fix some dataset/distribution of data, or maybe some benchmark/eval
We can use various interpretability techniques to narrow down what attention heads and MLPs are involved, make hypotheses for what functional purpose they serve, maybe do some causal scrubbing
And then great we have a giant computational graph that represents the circuit and appears to be the main stuff that matters for the behavior of interest
We can probably even use dictionary learning to make the graph and circuit even better
Logan did it before
But a missing component I'm seeing in interpretability is the ability to analyze (and utilize!) the edges of that graph. Like we know that sure the output of this plugs into the input of that, but what is that input/output
Well, skipping over more refined stuff like attention head outputs, we can probe into the residual stream at any point and use dictionary learning to decompose it into sparse combinations of features that sometimes seem to have actual meaning to them
What I see in the channel currently is actually we can understand the residual stream pretty damn well actually. This seems pretty huge to me
Ok, what are those same questions but within the scope of sparse coding specifically?
I want to use sparse coding on the residual stream before and after a transformer block, to better understand what the transformer block is doing
I would rather work in
A -> Attn -> A -> MLP -> A
Rather than
B -> Attn -> C -> MLP -> D
Where A,B,C,D are dictionaries for the residual stream.
A is trained on data from different spots
B,C,D are localized to that spot
I guess my confusion is that doesn't that implicitly make the same assumptions as this?
Hmm. Maybe? 😅
I'm boarding a flight now, I'll have time to think about this more
But I think you're right lol. I feel pretty dumb now
Nvm back in my camp. no i don't think it implicitly makes those assumptions, though we may end up having less interpretable individual features. Hmmm
Excited to hear your conclusions on this, I'm doing inter-layer stuff this week
question: do you think that for all the autointerp stuff we should run it on a different chunk of data to the ones that the models/decompositions have been trained on?
at the moment we use chunk 0 to train all of the decompositions and then also for all of the interp stuff
BTW with very minimal modification to your code, I was able to train an autoencoder on roneneldan/tinystories-33m, and it finds meaningful features. Behold, the mojibake feature:
I'll clean up the changes tonight and send in a PR to support that model and to describe the process in the readme, I don't think it'll conflict with any of the stuff you've done recently
Edit: OK, there was more cleanup than I thought. Making PR tomorrow.
have also asked lee, looks like he's online
ah very nice!
wanted to check whether we tend to find directions which match up with either embed or unembed matrices, but if i've not made any mistakes then seems like there's pretty much none of that, which is honestly very surprising to me
layer patterns seem right
i thought r32 layer 0, which is the case where it benefits from the large ratio the most, might be really latching onto tokens
Oh cool! It is quite above random, though I really do expect layer 5 to have way higher cs than layer 4 in the unembed. Maybe something going on w/ the mean, since the last layer learns less features than the other layers (ie has more dead features).
I do expect unembedding dims to be a linear combination of dictionary features (in the last layer), and to priviledge more common tokens than less common.
I'd expect ablating the direction cause the next few tokens to be worse predicted.
Also, there's minimal code to be able to search for features that activate in the last token position of some custom text in my notebook if you've seen it.
Yes, probably. I wrote a little util for activation generation today, I'll add a feature for this and do a PR.
wait whats the diff btwn that and the activation generation funcs that already exist, either for train and interpret?
i should be able to do this today but i feel pretty awful atm so if you did thatd be great
yeah i wouldnt trust the results up at 16 - 32x, some of them aren't converged for sure
ok am doing this now, saving what would be the 101th chunk of the pile as a test set for this stuff
No this is just a command line util for doing that + also generating multiple layers activations at once
Oh lol
look at make_one_chunk_per_layer in standard metrics
no parallelization tbf so it could be way faster
@pallid current
what's the actual task that this data comes from?
wait sorry id been totally stupid, the interp stuff uses openwebtext by default so it's already off the training distribution
Pronoun prediction, dumb hacked together task, need a big dataset to do a solid test for sure
early stages in doing the big auto interp results but early returns are pretty based
lot of recent anxiety about ica quelled by this image lol
general picture emerging is that winrate of sparse coding vs all baselines on top or top-random is very high, but a lot lower on random-only, where ica-top-k is quite competitive and sometimes our dicts just do quite poorly
what's the rough normalized MSE / sparsity you're getting in gpt2sm?
hey, these are the graphs we're getting in the residual stream for pythia70M, we dont have gpt2sm results to hand i dont think
hmm how are you calculating unexplained variance? (or is that the same thing as normalized MSE)?
(normalized MSE: MSE / (target -target.mean())**2)
yep that's it, residuals = (batch - x_hat).pow(2).mean() total = (batch - batch.mean(dim=0)).pow(2).mean() return residuals / total
also the points near 0 are a bit hard to tell, is there a log plot?
cool
oh also how big is the latent here
dont have a log plot to hand, can make one tomorrow easily if you're interested. latent dim is 512
no worries no need to go out of your way
we've generally focused on regime of 100 or fewer active features, though the autointerp metrics are surprisingly not that strongly correlated with l1 coef
yeah just trying to get a sense of what reasonable normalized MSE scores are
because the anthropic toy model results were like 1e-3 and I was like "woah that's pretty low"
is that public?
i think we've kind of stopped looking for an exact right like dict size or l1 value like they saw with that bounce, like maybe there is some way but with the llms/autoencoders we're using it seems to always be pretty smooth
I see
another Q: any intuition on where on the sparsity/reconstruction tradeoff you want to be?
(context: I'm currently doing something like the kurtosis based autoencoder thing I described last time, and sparsity is controlled with a very weird set of hparams (as opposed to just tuning L1), so I haven't been paying it much attention. but maybe I should)
the way i was hoping to answer it was to use the autointerp scores, using random scoring, to quantify the overall % of the variance of the layer which is captured by the explanation
I see
basically weighting interp scores by feature variance. issue is that if there's signal in the interp scores of different dict sizes and l1 values (within a reasonable range) then it's pretty slight
also the performance of these dicts is strongest on top or top-random scoring, the random performance is a little disappointing
maybe worth trying random-among-activating?
like random but throw out anything that the relu clamps to 0
oh interesting, yeah we're screening sentences for nonzero variance but i guess you mean not including nonzeros in the correlation calculation?
might also be hurt by having to use 3.5 as the simulator, when i've looked at the simulations they're quite painfully dumb often
4 is better but also pretty dumb
tbh I don't actually know what would be a really good metric
also I'm surprised your dictionary is so small
and the sparsity is not that high as a fraction of the dictionary
does larger dictionary+more sparse help?
in my experiments I've been looking at really big really sparse autoencoders (like 100 active/10k dictionary)
yeah i agree it's a bit surprising how well the small dicts work, i think there's some reason that sparse coding seems to work that we dont fully understand/doesn't match our intuitions - though we do go up to 32x ratio = 16k feats, or 72k feats for MLP (though those were probably a mistake 😅). i think it might be that the larger ones take longer to converge and also that they struggle to learn closely related features
you start to get lots of features dying by that stage. we talked about trying reinitialization methods where if a feature hardly ever activates it you reinitialize it, randomly or with some residual vector or something but we haven't put time into it
ah I see
just an intuition, unsubstantiated that because of the l1 penalty when the features are closely packed it's harder for a dict element to find a new feature rather than the degenerate 'just turn off' solution
got it
hopefully will publish current results to get some interest and then try to dig into really understanding what's going on a bit more
thanks! would love to get your feedback on a draft in a week or two if you've got time
yeah the code is set up to be able to run it, might even have some dicts floating around already. will def run a proper set before we publish. any metrics you'd be particularly interested in?
i dont think we'll have the credits to do that much autointerp on it (unless we got a load more haha 👀), would def be interesting to be able to directly compare with your original results but would need gpt-4 logprobs for that
oh btw what's the reason you chose 10k feats/100 sparsity?
oh well I'm trying many different feature counts
and 100 active is not something I can easily control directly
there are like 4 knobs that might in theory affect the number active? but I've never actually tried to make active go up or down
yeah fair, but is there some signal about what number of feats might be appropriate? (meaning the 10k not the 100)
hmm
I've just been making the dictionary bigger and bigger lol
my intuition is that there must surely be lots of features that don't activate 99% of the time
as opposed to only somewhat more features than model dim that don't activate 70% of the time
unfortunately making the dictionary bigger doesn't always improve loss in my setup unless you tune some knobs
yeah i agree there must in some sense be an incredibly long tail of feats, though i think that in order for the model to be able to actually work with that amount of superposition it must translate dataset features into a latent space which allows a lot of compression
and in that case it's unclear to me what number of feats really means
why would that be necessary?
you can still pick out each feature if you want to use it
most things just won't operate on most features but that's fine
yeah treating each feature on it's own you can isolate arbitrarily many separate feats
but we can tell from the non-sparsity of the neuron basis that it's not actually picking out just tiny parts of the residual stream to act on
so there must be some degree of similarity between how it treats similar parts of the residual stream which increases as features get closer
hmm not sure this implies that
i think you should be able to get some kind of interesting bound on superposition by making sure that the noise doesn't grow exponentially as you do sequential computation on the features but i've not yet had the chance to really try and flesh it out
maybe the MLP uses multiple neurons to add lots of bends to a function that operates on one feature
maybe the MLP is actually doing some crazy non neuron basis aligned computation
I don't see reason to expect the MLP basis to be anything too sane
hmmm i think thats kinda possible but high levels of superposition make it incredibly difficult for it to be robust, at least as i understand it
like the position of the nonlinearity will change a lot depending on which other features are active which makes it super hard to build a complex nonlinear func
i ran some mlp tests recently to try and gauge to what extent you still get a robustly nonlinear response curve when features were distributed across multiple neurons and it seems that it gets pretty linear after you start to spread your mlp features over just a handful of neurons
but then clearly we do seem to see feats across neurons, from sparse probing and sparse coding etc
so im pretty confused basically
What were these tests?
generating a synthetic dataset as we would for toy models but constraining each feature to be across no more than n dimensions, and then for each feature, reading it directly with a linear layer + GELU and seeing whether you still got a real nonlinearity
so with n=1 you just see standard GELU
but as you increase n you start to see a pretty flat response curve
this is, for 200 dim space, n=1, 200 feats; n=3 500 feats; n=10 1000 feats
bump at 2 is an artefact of how i do sparsity
Ok, what's the 'linear layer' in the 'linear layer + GELU' here
linear layer is the transpose of the feature generation matrix, so each entry in the linear layer is the coefficients of one of the features
Oh ok
hope that makes sense lol, can explain in the morn, am off to bed now
Not really but cool
see you tomorrow/at meeting
feel like ive never managed to explain it well which might mean im chatting shit 😅 , see ya
I'm just quite confused as to why you think this shows anything about MLP structure ig I misunderstood maybe, is the structure superposed features -> GELU -> linear filter, or superposed features -> linear filter -> GELU?
maybe the latter? not sure what you mean by linear filter
its v short, will upload as a gist
cool will read later
I guess I would be convinced if it was a learned encoding not a random one.
yeah thats definitely a big weakness but its also kinda generous in terms of like, only distributed across a few feats, not crazy ratios or anything. would be interesting to compare neat geometric patterns or learned feats
but i do think its useful as a counterweight to just like, johnson-lindenstrauss therefore exponentially many feats in a layer, like you cant just approach it naively and retain a meaningful nonlinearity, (unless ive misunderstood somehow)
should probably just throw sparse learned feats or something similar into it and see wht comes out
ok its very annoying that we didnt realise this earlier but there's a huge difference between tied and untied dictionaries in the MLP
at least by n feats active
e.g. compare number / % active for layer 2, second one is untied
This is middle-MLP, right?
yep
Ah
mmcs for untied on MLP is pretty terrible which somewhat explains the bad results
doesnt explain why we're getting those bad mmcs scores though! what's changed??
@keen pivot have you been doing any MLP training runs recently on the old code? wanna compare hparams
I still have the old code up to compare. Which ones?
LR = 1e-3
I've only got MLP_out.
But I got awful MCS hist doing 3e-4, and just got much better doing 1e-3
run just now?
The above is 1e-3. Doing 3e-4 for 10 chunks, it produced one like this (peaks around 0.4), but maybe 30 above 0.9
hhhhhhhhhhhhhhuh
that's incredibly weird and I don't understand that at all.
So I switched to full-rank ablations on a whim and we still beat LEACE at earlier layers in the model? Need to reconfigure feature selection procedure to count and compare with different database sizes, but it looks like we are finding a single direction to do full-rank ablations on better than LEACE
have you looked at some basic interp on the feature you're ablating just to get a sanity check?
yeah, logan did, 722 (main one) activates on female pronouns or something, which is slightly confusing? going to check if it's gaming the metric somehow
perfectly plausible that it doesn't make much sense to do this kind of thing on a model with such low baseline prediciton accuracy
ok well yeah that sounds about right
I do think female pronouns sounds more relevant than "capital letter Q" feature (just to give an example)
oh, for sure
Not too difficult to just grab a middle layer of Pythia 1.4b to train a dict real quick & re-do results, if you want?
yeah will do soonish
what l1 values/dict sizes worked well for that?
I'd expect 0.001 & 4x-8x. Would need to check my run from back then.
👍
@pallid current , Is there an easy way for me to load in the dataset of the first chunk (specifically what PCA/ICA were trained on)? I don't want just the layer activations, but the original text/tokens
hmmm not like suuper easy, depends what exactly you need. in activation_dataset.py you can use make_sentence_dataset to get you a big load of sentences from the beginning of the pile. you can calculate tokenize each sentence and calc how many tokens will go into 1 chunk (2GB / 4 bytes i think) and just stop there and that's your data
note that ICA is trained for 1 chunk but PCA batched across 10
wait one second logan I have a util for this standby pushing now
if you want more exact idea of what does in you'd need o modify chunk and tokenize to return sentences
on my branch I edited activation_dataset.py and added a script to generate chunks
I checked. I did
layer=15,
l1 = 1e-3 (w/ 12k, I predict sparsity of ~50-60)
dict=6x is probably best.
n_chunks: probably 20 (I did 10, but was still converging on the 8x dict)
tied, lr=1e-3, though I also had a bias for the decoder.
I expect not having decoder bias is ~fine, should be ~centered at 0
If you can see this.
what batch sizes are these?
mmcs looks good!
for what?
^ this image if this is to me
ah, sure
@pallid current did a PR, this one is probably a bit messy and awful to work through sorry, since I rewrote some datagen code to make it work with more models & added a util for saving activations
--n_chunks=20 --layer=15 --mini_runs=10 --batch_size=1024
From the overview button (on the top-left of wandb)
Though I would like to note that I trained it for 10 mini_runs over 20 chunks. So 200 chunks!
Oh jesus
This one is for sure tied ae, and still converging (when looking at MMCS) after 50 chunks
This is an error on my part, because I mixed up the "n_chunks" & "mini-runs" part. I do think dictionaries trained under 10 chunks (which is what we've normally done) are undertrained for larger dicts
This is for 10 chunks (top) and 5 chunks (bottom)
And this is for 50 chunks (top) and 45 chunks (bottom):
you might want to move the activation width stuff out of utils as it requires transformerlens which takes a while to load
pod shitting itself rn; what's the data in 3 and 5?
dunno abt 3/5 (tho think logan's using 3), i'm running an 8gpu sweep for checking higher lr low bach
can shut it off in a sec
@keen pivot logan can you run something like that but with MLP please?
my small batchsize/higher lr 10 chunk untied mlp didnt see much of a shift, v poor mmcs above like 3e-4
I'm probably 3 & 5. Also training a new 160m
MLP-middle? Tied or untied?
@pallid current, the pca_topk is size 1k, and ica_topk is size 500 in the mnt/.../baselines/ folder. Is there a same-sized version.
on the old code?
wait, worse MMCS as lr increases?
yeah, really want to be able to recreate whatever worked before
no, as l1a increases, soz
oh right
like layer 2, Pythia 70m?
yeah ideally
what was the MCS tradeoff stuff for MLP on old code?
tbh i dont remember the tradeoffs in detail, but generally mcs could be solidly high at the normal l1 ranges
incredibly confused
me too
Running! Currently downloading pile. Can send the aws when it's ready
Also, I had to change a lot of stuff to make it work, so like 50% I screwed up and it's running something we don't want!
@pallid current, any thoughts?
I vaguely remember having much lower l1 values, do I remember this correctly?
oh sorry, yeah this is the PCA top_k encoder splitting it into positive and negative features as separate things
whereas the ICA one doesnt do that
agree that shouldnt be a discrepancy
what kinda things needed to change?
Like tied to untied, and setting the hyperparams.
finally got round to properly suppressing those aten warnings, sooo much more satisfying to train now 
ok we've basically waaaay undertrained our mlp dicts
yellow being 110 chunk-epochs, green and purple being the original 10
Whoooo!
Do you still need some trained MLP ones from the old code? (I've got a run here, but I'm unsure on the hyperparams. I did get between 20-100 sparsity though) https://wandb.ai/sparse_coding/sparse coding/groups/EleutherAI%2Fpythia-70m-deduped_2_0817-004629-EleutherAI%2Fpythia-70m-deduped-2_graphs/workspace?workspace=user-elriggs
cheers, from what im seeing the much longer run is sufficient to explain the difference but def good to be able to compare to the old stuff to make sure
Ya, I was mostly unsure cause the MMCS was much lower than the residual stream at the same number of chunks.
May be explained by the tied vs untied though (I remember tied having better MCS given same chunks trained)
Hey, exciting work! I’m trying to run this notebook from your LW post. Is the auto encoder ('/root/sparse_coding/auto_encoders.pkl’) available on GitHub / HF?
Should work! Let me know if you run into issues.
Confirmed: PCA top component is the outlier dimension
Guys. PCA_topk & ICA_topk suck soooo much. Like I can't even get a decent one for the apostrophe one.
This is the neuron basis. And I'm zooming in after 5.11 basically:
This is the best one I could find! Out of searching top-10.
BAM: Our dicts rock
Note: Going larger l1 (closer to the "all dead features" solution), there weren't any features found that weren't the outlier dimension.
Going smaller l1 (closer to the "identity/polysemantic" solution), there weren't any features found that didn't also activate for >10% of the dataset.
Explanation of why I'm no longer comparing to LEACE directly since @pallid current asked:
Given a set of labelled points X with classes Z, LEACE guarantees that no linear classifier can predict the Z_i from the X_i well. This is fine and good if your X_i have shape seq_len×d_activation but if you erase on datapoints X_i of size d_activation then you don't guarantee that the model, which essentially sees (seq_len, d_activation) can't implement a linear classifier between them (and in fact it can and does).
This is a problem for direct comparisons as
- it's not reasonable to compare LEACE KL-divergence OOD (e.g. on the pile); plausibily still useful to measure on-distribution
- ablating a single feature direction is equivalent to a rank-seq_len ablation in the seq_len×d_activation space, so the ablations aren't the same
More generally I think we are aiming at much more general activation engineering with sparse feature dictionaries and it's not clear how to measure this
yessss so glad to see this, nice one!
so is this why LEACE isn't bang on 0.5 in the comparison graphs you posted yesterday?
well, even with the proper ablations it's not perfect at low levels. At layer 6 on pythia-1.4b it get something like 0.56 which is still quite bad and worse than ours
I mean, I think I could still compare KL divergence on-distribution, might be something there
Like, even if our ablations are rank-k if we get a lower KL it's useful
wait so is that a proper comparison? like even if it can edit seq_len x activation_d and we just edit a single feature or a few, we can remove downstream perf
I think so yeah
I'll probably write up those results
I'll be surprised if we do get better KL divergence because that would have weird implications for models using data mostly linearly or not. LEACE guarantees minimal edit for no linear classification under any inner product norm, including the one 'expected change to KL div per unit shift in this axis' although realistically it's probably horribly nonlinear
Yeah nvm that it's going to be horribly nonlinear
right as i understand LEACE it removes the ability to predict from the activations at that layer, but assuming that there's non-linear work done between that layer and the output, its plausible that you can improve on LEACE in terms of whether there's like gender differences in the output
but then i suppose you could say even if there's non-linear computation ongoing between layer and output, if at any residual layer the information is stored linearly then linear concept erasure should be able to remove it
@bronze wraith
hahahaha how did i not see that before
@bronze wraith
yeah I'm pretty confused as to why rank-one ablations with our dicts perform so well; perhaps we shift the distribution of the text less so some backup nonlinear weird system doesn't contrib much to the predictions while in LEACE the ablation induces a signficant distributional shift so the model uses some nonlinear prediction method
kind of want to see what we can do with a synthetic labelled dataset generated by e.g. GPT-3.5
not sure how you'd show very general erasure though
@bitter turtle I have very little idea what this experiment actually is so you'll have to explain it to me from the beginning
that is entirely fair
I think generally I was misapplying LEACE horribly and so any results I had previously are bad and should be ignored, I'll get back to you when I have actual results
oh ok lol
also, do you have access to concept labels at inference time in this experiment?
If so, you should use OracleFitter / OracleEraser which was in the main branch for a while; I just now pushed a v0.2 PyPI release which has it
Oracle LEACE drops the assumption that the erasure function has to be an affine transformation, and for each component of the representation in any arbitrarily selected orthonormal basis, directly minimizes the squared distance between the original and the scrubbed value
If that still doesn't work you could try Quadratic LEACE, which I've mostly implemented but haven't merged into main yet. It's definitely going to be a less surgical edit but it ensures that no linear or quadratic classifier can extract any info about the concept.
Your use case might help me decide a couple details about how to implement QuadraticFitter and QuadraticEraser. Quadratic LEACE is an oracle method, so it requires concept labels at inference time, and it also has this weird property where you have to like "dispatch" each individual data point to a different affine transform depending on its concept label, which is hard to do in an efficient vectorized way. You can do it quasi efficiently with like torch.unique but I'm still unsure about whether to do this batching/preprocessing step on every call to QuadraticEraser.forward or force the user to do the preprocessing all at once or smth
Wasn't planning to give it access to labels at inference time/use oracle erasers, but maybe later on I will.
ok
made some big summary graphs for residual stream
dip in quality seen from some of the 16 and 32 ratio dicts indicates they might be a bit undertrained
Have you guys tried end-to-end training the dictionary to minimize loss btw? I suggested this to @keen pivot and he said it was part of the plan
Definitely on my future todo and Aidan mentioned it first I believe. It might not make the end-of-the-month to-do’s & results, but it’d be nice!
Okay, just couldn't fall asleep, so I checked some perplexity stuff. Currently getting quite good perplexity diff relative to what I got earlier, even on only 10 chunks (like what?). Gonna run a few tests to check it out & see if I'm seeing straight.
What's this test?
as expected we don't have better KL div on-distribution
currently slightly unsure what activation editing directions to persue, could look at
- ability for activation editing with feature dictionaries to generalise to multiple tasks (not sure what this would look like from a testing perspective)
- comparison to activation editing strategies like nulling the mean-diff vector
I expect general problems with robustness because it's not even clear what it means to do distributionally-robust activation editing at this point
I think what I'll end up doing for the paper is more like what @keen pivot's doing at the moment, I'll park more ambitious things for another time
Still got these results although need proper L1=0 run for baselining @pallid current
Yooo, still got that weight based MLP residual thing right?
yeah still have that to do
Not entirely sure what I should be doing with that, can't see any immediate inroads to distilling the relationships between features
Multiply the features by MLP-in and compare to the dictionary at MLP-in
not sure that's neccesarily a useful distilation bc obviously it's very dependent on the nonlinearity
Sure, but it might work?
sure
I tried from layer 4 to 5 and it sucked, but this might work
You could even try applying the layernorm and GeLU before doing MCS to see if it’s meaningful.
doubtful
In my heart-of-hearts I believe it may work best by integrating one in-distribution text, but that's no longer just weight-based
yes I think the focus on purely weight-based analysis is somewhat misguided
But it would be so cool
Ya, this is basically it. I was only able to get to 30 perplexity by going 6x. 25 is the dream. You could get there by going polysemantic, but no point at that point.
Would be interested to see doing KL-div training at this point!
check it out i found a math feature
I was trying to find features that alight with themselves more after they are transformed by a head
that is feature that has high cos_sim(W_OV f, f) where f is a learnt dictionary feature and W_OV is a tranformation from residual stream to residual stream
For some reason, features that are sorted in this way tend to be highly interpretable(there seems to be a context)
@keen pivot the cosine sims of MLP-out don't look to be especially interpretable, this is a histogram of the gini indexes of MLP_dict @ resid_dict, i.e. a measure of how spread out the effect of one MLP-out direction activating should be; ideally if it were interpretable the effect would be sparse and we'd see this in the plot
hists of cosine sim
I'm honestly thinking i've got the wrong dicts 🤔
it is mlp_out_l2 that flows into residual_l2 right @pallid current?
It should be! We're doing post_residual for residual which is at the end of the layer.
ok so it works with residual thingies in each layer with REALLY HIGH l1 values (like the highest we use). 1e-3 doesn't seem to have this.
this is between l1 and l2 btw will check others soon
@keen pivot didn't you do something similar to this I can't remember
Okay, so I'm seeing a research direction here (for later). We can develop a measure of monosemanticity by doing several token-level histograms for lower & lower l1 values. They eventually just suck so much, that you say "that's bad!" and move on. Then we have our target KL & do the KL divergence thing.
Nope! Or I can't remember either then
yeah this seems fair for the earlier layers
Why earlier layers specifically?
token-level stuff
I've seen it in layer 5/6, but maybe there's more token-level features earlier?
oh I guess I was talking more generally
I'd expect more re-tokenization/bigram-esque stuff in earlier layers
Yep yep. A statistical argument makes sense.
l3-l4
seems better for larger dictionaries (still resid-to-resid, but this time r=32)
"Do some interpretability stuff" at the bottom 😆
Looks cool! I'm confused what the OV self-multiplication is supposed to do, but you're noticing they're more interpretable. Have you tried looking at like 10 features that don't align w/ any W_OV (of any head in that layer)?
Also is this the same layer's W_OV or the next layer's?
Ya, I don't know. The l1=0.01 is so dead though.
ikr
This is interesting at least
Which models are you loading in? Like the directory?
Could you try tiedlong_residual... in Hoagy's?
/mnt/ssd-cluster/bigrun0308/tied_residual_l3_r32/_9/learned_dicts.pt
/mnt/ssd-cluster/bigrun0308/tied_residual_l4_r32/_9/learned_dicts.pt
this mlp or residual
is that in hoagy's?
one sec
oh cool
there aren't any dicts in there?
@pallid current
Every 20 dicts
sparse_coding_hoagy/tiedlong_tied_residual_l4_r4/_80/learned_dicts.p
Except the last one for whatever reason.
yep these are a lot better
also not seeing much difference for higher l1 which is a something sign
@pallid current do you have zero l1 baselines?
morning everyone 🙂
i've ran them overnight for mlp
ah, what about resid
not for residual, at least across the model yet, soz
aok
run died because i had one dict ratio as an int not float lmao [0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16]:
but only for untied 🤔 🤔
might not be that tbf
I need to work out more on the interpretability: Here is what I have observed cos_sim(W_OV f_i, f_i) tend to be greater than cos_sim(W_OV f_i, f_j) for j not equal i, assuming that features correspond to meaningful directions in the residual stream, the transformed feature by an attention head tend to align more with itself than with other features, my hunch was that W_OV is more similar to a compressing features so that they can be recovered (the recovered feature aligns with the original feature) rather than a self multiplication. Also since W_OV is responsible for copying, for features than align with themselves they might be copied.
This shows that the OV is responsible for extracting the attribute for some tokens
Atrributes, the way it is defined is taken to be the context for a token, that is very similar to the feature that a token relates, that is just a guess, i am unsure about this reasoning
I suppose one intuition or idea is that from the mathematical framework for transformer circuits paper, they put plots of the W_OV matrix eigenvalues. they draw blobs around the eigenvalues that cluster in the positive real direction; having a high CS of (W_OV f, f) means that feature is somewhere in that cluster (maybe not aligning perfectly with an eigenvector but thats ok) and with this library we can interpret it
I would think a vector doesn’t align with another feature (i and j) because they’re different features to begin with.
You’d get the same behavior with W_OV= identity matrix.
Thinking more, W_OV from residual to residual is like reading from this direction and writing to that direction.
So here, you’re finding directions that the OV circuit reading from means it writes to that same direction.
I wonder how you could connect this to real examples with sequence position.
Like suppose these features do copy information from one sequence to another. Can we see specific examples where that happens at specific sequence positions.
@bitter turtle for the record, I don’t quite understand the Gini coefficient stuff interpretation, and may be too tired to understand atm. But to give a take:
We want to know some statistic of how much one dictionary matches with another. For MLP-out, we expect lots of matches, much more than random, especially since it’s just an addition with attention.
@pallid current, a better version of this graph would show ratios 1-6, and I predict there's improvements up to ratio 6 for perplexity.
I agree, I'll just throw down my ideas so far:
- we want a given upstream feature to have a sparse impact on downstream features, i.e only impact a few and not all features uniformly (as random would) -- hence gini coeff stuff
- MCS is probably not a great indicator here. Many low cosine similarity values probably indicate randomness/unconnectedness. Having a CS of about 0.4 seems to indicate a degree of sparsity in terms of what features impact what other features
- we can probably also look at the covariance/correlation instead of the cosine sim (I would do this, but for some reason I am getting like 1800 dead features :/ )
Why not look at a feature, figure out what it does, and see if weight-based connections gives you any predictive ability on that specific feature?
I want to sort out the covariance stuff first
Also have you got the histogram thing integrated with the new setup?
If not send over the code anyway and I'll give a go on converting it
yep thats something im interested in. well originally i was thinking: can we find in a somewhat unsupervised fashion, prompts that result in activating these OV directions
but also how to track when information is flowing from one feature to another along a qk pair
Is that some conceptual work to figure this out, or just try the first 3 experiments that come to mind and see what sticks?
I just trained the 160m & sent it off to Lucia & Lovis. Whooo! My first Trello to-do done. WHooOOOooOOOooOOOoooo
a healthy mix of both
inspired from a few papers i read recently
Nice! Definitely feel free to post intermediate results & half-baked hypotheses here:)
like this was a hacky idea we threw together, the idea being that if we do for some reason find some semblance of copying behavior then it might be indicative of richer more abstract features, rather than something token-level. that would be harder to prove but we can just chuck it in and see what we get, so we did
What if you did that, but just on dictionary learned on attn_out?
what, as in compute (f_i, W_OV^-1 f_j)?
No. F_i is from a dictionary trained solely on the output of attn_out. Then you'll get non-token-level features for sure.
oh yeah
narmeen has just informed me that we already have those dictionaries. so i guess we should look at them lol
lololo
been struggling with the autointerp on mlps all day. very mixed results on the longer runs, usually doing quite well but num dead feats is high even with l1 at literally zero
Weeeeiiird
This maybe makes sense if it's learning an orthogonal basis/double that basis
Like I wouldn't expect any dead features up to ratio 2 but maybe from 2< I would
Would appreciate it if someone could provide feedback on this idea
For the purpose of an enumerative safety result for reward models learned during RLHF:
- extract n parameters for which the largest updates occurred during RLHF
- transplant them to the base model to confirm their applicability in reducing loss on some measure that quantifies the success of your RLHF
- (assuming the success of the last step:) obtain a representation exhibiting less superposition of some layer like in https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tentatively-found-600-monosemantic-features-in-a-small-lm for both the base model and the RLHFd model (assumedly the difference should at least partially encode information about the learned reward model)
- try to train an autoencoder to reconstruct activation vectors for the RLHFd model when given activation vectors from the base model as input in order to quantify how good the reconstructions from the previous step were (better reconstructions should be better at predicting the behavior of the RLHFd model assuming the autoencoder is able to properly reconstruct inputs for just the base and RLHF models separately)
Ideally resulting in an understanding of which parameter shifts caused different features to affect the outputs of the inspected layer
@keen pivot Hi! What marc/er mentioned is the project I told you I was working on when we met IRL
My initial take is that this would require improvements in the sparse autoencoders we are currently using, we are not currently able to accurately reconstruct inputs to the degree I'd think would be needed for this kind of experiment. I'm also not sure why the final step helps quantify goodness of reconstruction for this, could you elaborate maybe? I'm also not sure how you would be able to usefully extract information out of the difference between the decompositions learned in step 3. Excited to hear more!
Lol, I was about to connect you two!
It may make sense to just learn a dictionary on the reward model’s layers (and find features that lead to low and high reward). Does this make sense?
Additionally, you’ll have the RLHF model do better than the base model on some metric.
If you have dicts for both, then you can find the features that are responsible for one doing better than another.
How would you use dict differences in the third paragraph?
You find the features that are useful for a given task for each dict, then see which ones are different (ie MCS, qualitative analysis, input/outputs?)
I guess I'm just not sure how robust our learned dictionaries are to like initialisation state etc.
Ya that's good. So doing that experiment too would be good (ie train two dicts of the same size & check their different features on the same task)
Hey, is there a notebook for this LW post available anywhere?
There's an older version of code here: https://github.com/loganriggs/sparse_coding/blob/main/interpreting_sparse_dictionaries.ipynb
w/ the dictionary here: https://huggingface.co/Elriggs/autoencoder_layer_2_pythia70M_5_epochs/tree/main
Then the latest stuff:
https://github.com/loganriggs/sparse_coding/blob/main/lucia_and_lovis.ipynb
W/ the dictionaries for pythia-160m also in that directory
Yoooo, ablating the apostrophe feature mostly only effects the "s". (Note: I specifically searched for a feature that activated on the sentence: |Then we went to Dave'|, which is the possesive apostrophe, than other contractions)
When working w/ dicts across layers, I noticed the dicts seem:
- monotonically decreasing in MMCS (& # of features above 0.9)
- monotonically increasing in peak MMCS w/ respect to l1
layer 5 really seems to be something very different and less suited to sparse coding it seems, always looks v different
like the graphs though!
Layer 5 is so weird
prob be good to log the xaxis
It's weird that last layer peaks when first layer drops. Would need to run on another model before making any conclusions
Oh, to clarify: This is MMCS from dicts with their l1, um, "neighbor". So 3e-3 & 4e-3, and 4e-3 & 5e-3.
This is log, but on a smaller dict (last one was ratio 8, this is ratio 1)
doing some graphs of % features active for the appndix
What L1 values are we using for MLP?
And what do the reconstruction-sparsity plots for MLP look like?
You can maybe factor the legend out
mlp generally have been using a sweep of 8 from 1e-4 to 1e-2, though as you can see there's no real need to go above 1e-3
tied:
untied:
The purpose of the final step is to be able to benchmark the accuracy of your interpretation of the reward model. Assumedly if you're able to predict the activations of the RLHFd model using just the base model and the autoencoder you've trained to reconstruct activations from the RLHFd model given activation vectors from the base model your method has done a good job
This sounds good, I'm mostly just looking to reduce noise by narrowing our scope to include mostly components of the model involved in reward modelling
Yes, but I'm not sure how possible this is. It would be interesting if possible, but again 🤷
Like, have people even successfully done base model activations -> RLHFd model activations with an autoencoder? The final step seems to rely on this being possible.
Not from what I've seen. If it's not possible it won't have been that large of a time investment (I would imagine) and might be useful later if sparse coding continues to be pursued
Sure
various activation editing techniques across depth; for the first few layers there are single features we can ablate that basically solve the task, but in later layers there aren't
the most reasonable explanation I can come up for this is that since we are ablating sparsely-activating directions, our edits are more 'in-distribution' than e.g. LEACE's edits at earlier layers
oh, 'mean' here is 'ablate along the difference-in-means direction when considering all token positions as unique datapoints' which is like laughably untrue but hahahaha idk any other techniques that operate on a per-token-position basis
histograms for various best dict features for ablating for pronoun prediction
Good that these make sense for the early layers where it works!
One concern w/ layer 3 & 4 (which isn't much of a concern because they're bad) is that they might be outlier features, which ablating them makes a lot of tasks get worse (one indication is that the top tokens are sort of sorted by token frequency, but you could just do the "visualize feature" function to see if it activates for the first token & first delimiter).
yeah, I can ~tell which ones they are by their activation magnitude, I'll check for you
Ok, initialy it doesn't seem like they are
Ya the activation mag is a tell, though I have seen some positional features with lower mag, I think?
I think for future activation editing investigations (inc potentially investigations to work into this current paper if we still have time after I get back from holiday), it'd be really useful to have a sweep at 4xdict size and some l1 value done of pythia-410m or something of equivalent scale, so I might set one off shortlyish; @keen pivot @pallid current do you have any similar requirements/wants wrt sweeps of larger models?
i dont have any particular needs but agree that sweeps of larger models would be good
my main ask would be to include some quite large dict sizes and make sure to train for a long time to make sure the bigger ones aren't undertrained, to see if we can get a sense for where diminishing returns to size comes in for larger models
Like train for 100 chunks, save every 5?
yeah i would want to see the equivalent of 20-30 chunks for 70m so yeah probably up to 100 accounting for larger activations and maybe slower convergence, and making sure that we're tracking mmcs and loss/sparsity over time to see if we're missing out
been testing some variants of forcing the mlp directions to strictly be in the positive quadrant (and bumping up all the inputs by min(gelu)
loss curves are an absolute rollercoaster, and generally terrible in terms of convergence speed but i do kinda suspect there's something there
notably, that purple line is 2x overcomplete, competitive in loss (eventually, after being about 10,000x worse for the first 50 epochs 😅) with normal runs, and maintains about 100% live features
v possible its learning the identity or some degen solution tho
turns out it was learning some weird degen solution but one which meant that all the autointerp came out as 'newlines/periods', at least for the nonzero l1 runs
much to ponder, but leaving it it for now
New lines periods is like the outlier dimension for Pythia models
@cosmic yarrow that kind of thing is absolutely something I'm interested in doing, feel free to discuss in here
(wait, do people receive pings if they are in threads?)
I didn't receive a ping but very glad I checked this thread! I currently have a decent amount of free time and would be happy to help in whatever way.
I've been adapting Yun's code for this, but if you are thinking of having these experiments as part of the existing codebase for this project I'd be happy to ditch my codebase and work on a fork or contribute to your efforts.
I mean I'm kind of dissatisfied with the current codebase as-is, I'd like to clean it up/redesign things to have more flexibility in general. I'm also on holiday for about two weeks at the moment, so I won't be working on it in that time. It probably makes sense to integrate it with the current codebase though, yeah
I think in general I at least am kind of unsure what will happen after we finalise the current paper
Good to know! I'll check back in in two weeks and we can discuss more then—or whenever you're ready. In the meantime I'll continue refactoring the code I have so it's in a better place to integrate, in case that's what ends up happening. Have a nice vacation!
hey @cosmic yarrow, what kind of experiments are you thinking about doing?
It was looking into reproducing/extending Yun's paper more, either by doing multiple layer dictionaries or using the FISTA/solve dict/basically K-SVD iterative method described in the paper.
How to get involved in this project
My overall goal is to see if some of the circuit tracing/causal tracing methods can be adapted to explore these dictionaries. The first thing I wanted to do is extend Yun's method by also training a dictionary for the mid-residual stream, right after attention. And then, using the COUNTERFACT dataset, adapt methods like Geva et al https://arxiv.org/pdf/2304.14767.pdf for tracing information flow across the sparsified layers. This is all very hand-wavy for now, apologies if it doesn't make sense.
oh hey. me and a couple others are thinking about the same stuff. we're diving on mor geva's work rn
me, @hallow wyvern, @onyx compass, @frank vortex
im setting up notebooks for causal intervention techniques w/ counterfact dataset with narmeen, firstuserhere is getting a head start on geva's work, faulsname spent time a few weeks back getting familiar with the dictionary learning codebase
we should chat
Yes, definitely, would love to be involved in this effort!
I've got some causal ablation type stuff on my fork of the GitHub repo if that's helpful for you guys. TL;DR of as far as I got is that
- our features are decent for e.g. concept erasure at early layers (on pythia-70m)
- we can identify like a single feature which is responsible for IOI (by which I mean 'if you change this feature on the corrupted activation to its activation on the clean data you like 60% of the way to the behaviour on the clean data')
- you almost certainly want to use a dictionary set for a model better than pythia-70m lol
Also if I were to do it again I would mean-center the activations for every layer
You also probably don't want to directly convert ACDC, the amount of graph edges is absolutely absurd for even moderate dictionary sizes
Oh! Also I ran into a bunch of annoyances with imperfect reconstructions @cosmic yarrow absolutely forsee FISTA at least partially solving those. I would train with FISTA like Yun et al if I were to do it again
Sorry for not remembering, but did you all work with methods that allow for the dictionary dimension to exceed the hidden dim?
someone share the github link of this project, i have done some work in interpretability area, the project seems interesting and i want to work on it if there are some ideas to be tested
Though we should update the readme.
I have loads of ideas to test! Let me get to the office and I can send a list.
okies, mean while i get a hang of the project repository
Yes, we used ratios from 0.5-32x the number of hidden dimensions. However, we used a kind-of-shit linear autoencoder which was kind of overzealous with eliminating noise, and so we got imperfect reconstructions
We got these kind of curves in terms of activations-per-example against reconstruction loss.
FISTA should converge to better (more exact) solutions in the highly sparse regime, is my thinking
You can squint Very Hard and see our encoders as kind of being almost a single iteration of (L)ISTA, to give you an idea of how 'good' our encoders are
yeah, makes sense that running several optimizations steps of ISTA or FISTA would help
I bet against this, but would love for someone to run the experiment!
Interesting, what's your thought process?
Didn’t you do a multi layer LISTA?
I think that failed because of accursed convergence/not enough iterations etc etc.
Oh, then I no longer bet against this! Lol
I think FISTA with our current dicts would be good, less sure (but still pretty sure) about dicts trained with FISTA.
Oh lol
Initializing with our current dicts?
I was kind of expecting doing the KL thing to close the gap (at least for functional equivalence), but a better solver would be complementary to this.
No, even with just using FISTA to generate the sparse codes
And still just a linear layer for the decoder? I saw Yun do some hessian thing for that part as well
i still do have some feeling that our autoencoder methods should be helpful in that they should more closely track which features can be easily pulled out by a linear layer. though tbh the residual stream has so much capacity that it's quite likely this is barely an issue
Hmm, I'm not so sure actually! I think it might close the gap slightly, in the lens of KL div, but I think there'll be some leftover
I'm kind of viewing it as more 'we are using sparsity to disentangle the latents' rather than anything mechanistic ATM tbh
Yeah, just linear decoder, what was the 'Hessian thing'?
Yeah I also think the algo for feature extraction, if that is indeed what's happening in MLPs, could be slightly different. You could condition on certain features being 'off' rather than just eliminating noise with a bias, which might partially explain why untied might work better (do we see this? No idea, can't remember, maybe a little in certain circumstances)
btw one thing that i think we've underexplored so far is whether there's actually a maximum number of features that we tend to find. like we often see that with mlp we don't see the largest ratio having the larger number of active features. curious if that's eventually also the case with residual stream at some size, like it seems d(active_feats)/d(num_feats) is declining at 32x, some layers more than others, but we havent checked if it ever hits 0 for residual stream
running 64 and 96 on gpt2sm to test this, though should prob go back to pythia70m because the extra dims hurt a lot for this
This code: https://github.com/zeyuyun1/TransformerVis/blob/main/sparsify_PyTorch.py#L16
Which is used here on the dictionary (ie decoder): https://github.com/zeyuyun1/TransformerVis/blob/main/train.py#L172
Oh, I think that's just how they optimise the dictionary.
@gilded merlin, just my general list of things-to-do:
- Learn circuits for many target tasks: adversarial examples, chess/othello, in-context learning, deception, truthfulness, sycophancy, etc.
Like the dictionaries aren't perfect atm, but trying to extract circuits from real things will still inform what heuristics to use (then better dictionaries can be slotted in later).
-
Better dictionaries: FISTA stuff above (& I want to chat w/ some Harvard people here who do dictionary learning as their research once our paper's on arxiv) and KL-divergence penalty (this is what I'm doing this week).
-
Activation engineering using our learned dictionary features
-
Better automatic circuit detection (A) how to go forward (e.g. layer 3->4), (B) connect w/ dictionaries learned on MLP-out & Attn-out, (C) weight based connections (ie features in residual connect to the features in MLP, and weights connect them. Can you predict this from just the weights?)
-
Connecting circuits learned w/ datapoints. How do datapoints lead to learned features/dictionaries? This could even be paired up w/ developing Deep Learning theories since it's easier to develop theories when you have specific examples of circuit-formation.
-
Refactoring code for my manual interp stuff & make a standalone colab notebook that can load in a dictionary from e.g. hugging face
-
Optimization help: code in perplexity check every N batches, fix wandbd display (or some equivalent), have a changing l1 value to specify a set sparsity (ie features/token)
-
Updated Github w/ minimal code to run on a new model & look at it in details.
Optimize the dictionary for what? (I'm assuming reconstruction)
For this week, I'm going to get that KL thing working & get the perplexity check coded up as well.
@keen pivot does the experiment in the repository can be easily run on a colab
For training a dictionary, no.
which of above tasks can be done on colab
It was possible at one point, and it's on the to-do to make them so
A colab notebook is going to not be so great because you need other files.
I think (6) is the one for this then
yeah i know, but you do needed a gpu for this ?
You very much need a GPU for it to go reasonably, and the Colab one should work
But the current repo is optimized for like multiple GPU's and having the repo loaded, which is doable to put in a colab
But will be difficult for pushing useful PR's
You could convert this file: https://github.com/loganriggs/sparse_coding/blob/main/lucia_and_lovis.ipynb
To a notebook. You need to have the dictionary loaded from pythia160m, and the autoencoders folder from: https://github.com/HoagyC/sparse_coding
and I'm unsure how to import that to a notebook.
so task 6 is for demonstration easiness basically
Yep
I will say vast.ai is quite cheap compute & only takes a day or so to learn.
It would help!
Will start on that, if any problems come i will ping you
In-context learning (ICL) is one of the most powerful and most unexpected
capabilities to emerge in recent transformer-based large language models
(LLMs). Yet the mechanisms that underlie it are poorly understood. In this
paper, we demonstrate that comparable ICL capabilities can be acquired by an
alternative sequence prediction learning method ...
Have you seen this paper, in above list you mentioned learning circuits for target tasks i.e in-context learning
this paper might give some insight or food for thought regarding in-context learning, i read it recently
When will this be done/how did it go?
finished now (for a couple of layers), havent had a chance to look at results yet, will v soon tho, currently getting the proper results for the correlation btwn interp scores and e.g. kurtosis and skew. took longer than expected cos needed a batched version of the moment calculations which was a little awkward
now rnning results, running it for lots of different chunks so its mega slow, results in about an hr
hmm the number of feats just keeping growing
will switch to pythia 70M and keep cranking it up
might have to keep turning the batches up as well, this is 32 chunks, and the time taken to converge to this shape gets higher as you increase the size
worried that the 10chunk 32x dicts were a bit undertrained
now setting off 32, 64, 128 and 256 on pythia70M for 64 chunks. time might be way too long and might have to settle for a single layer, results are generally v consistent across non final layers
hmm eta like 5 days :(( will restrict to layer 2
still will have to wait like 2 days for long enough 256x results
remind me on wednesday morn to check the 256x results lol, other ones will be done by morn
Holy shit pahaha; we should look into ways to optimise this for scaling purposes maybe
got some level of results from the larger dicts sizes. not seeing any reduction in the ability of the model to incorporate more and more features, even as the % of active features keeps falling. these result are a bit odd in that they dont show as much of a dropoff towards high or low values, which we see both the gpt2sm just above and also in the previous pythia70m runs.
256 is almost certainly undertrained (still running) this is after 32 chunks
How do they look on the reconstruction-sparsity plot
OOM trying to answer this lmao
sorry went for lunch but i can just take the sample size down
results are really annoying though bc they dont match up with the previous small batch runs
pain to see but there's no improvement at these sizes. you can see that 32-128 are on top of each other and 256 is undertrained
Is that l1 in the legend? Edit: Wait it can't be cause we have different sparsities. What's in the legend? @pallid current
really dont understand why sparsity isnt improving
legend is ratio and the total number of feats
from 1k to 131k
I really feel like we will see vastly different results if we use better encoders (like e.g. FISTA) and we'll get better translation from dict size to FVU/sparsity; I think maybe part of what's happening with our current models is that our current models need to be more zealous with their cutoff the more directions we have, and so they don't necessarily improve FVU.
Are these tied or untied?
Concretely I'd expect to see some increase in bias amount the larger we go, and also see better FVU-sparsity given dict size with untied models because then they can do stuff like 'activate if this other feature is not active'
tied
ok well if thats your guess we'd better start testing that - i thought you had some early results from FISTA a couple of months ago and it was underwhelming
That was LISTA and that's because it converged badly
Haven't actually tried regular non-neural-net FISTA
i dont understand why these larger models - with many more active features! - dont increase the bias and become more specific though
tfw i dont have a username for yann.lecun.com https://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf
Well, I guess if they did too much they'd have really shit reconstruction, because we use a ReLU for thingy
yeah but in that case we don't have a mechanism for why they would even bother turning on more features. like if it's not improving reconstruction loss, and there aren't fewer features active per token, then that's all of the losses so what's the value in these extr` feats?
This is where the clustering experiment would be useful
Initialisation or something? Since the bias is always initialised to about 0 it might be closer to initialisation to spread everything out
Also think we should try this type of activation out further
Smh what the hell
so like adding back in the bias but with a bit of additional smoothness? makes sense to me
Ye exactly
I think I have a thing on my branch/maybe your branch called Thresholding or something
The KL stuff is going well. I think we can tradeoff some reconstruction loss for KL/perplexity, which I think is what we really want (ie features that have intuitive, causal effects)
I guess to test this you could do runs with x biases initialised at 0 and the rest at higher values or something and see how it changes the ending no. Active features
oh that rings a bell. i also have a bias-added-back (without smoothing) lying around somewhere, i dont think properly tested
Note: I need to add to my future work list: monosemanticity metric.
If we had a dataset that had very basic features (maybe just token level) and made a metric on how monosemantic it is (maybe defined by a weighted histogram measure?), this may also inform how monesemantic other types of features are.
It would also be a cheap test for checking if our dictionaries at different sparsities/hyperparams are more monosemantic
made a few notes from the textbook im reading now of things particularly relevant to sparse coding, welcome to read here, super messy, will prob keep adding to it https://docs.google.com/document/d/1MzSS2EFXtva5uTWxl7KGjSs_t3Z5mv7yoM2bMsZjATs/edit?usp=sharing
Notes for sparse coding on High Dimensional Data Analysis with Low Dimensional Models by John Wright and Yi Ma Qu, Zhai, Li et al - Analysis of the Optimization Landscapes for Overcomplete Representation Learning They study sparse overcomplete dictionary learning, show that problems can be formu...
I think I also want to scale up the concept erasure stuff to a more capable model, do you think we could do a run on pythia-410m? At maybe like 4x dim size and a few L1 levels over all layers?
Or, if not all layers, maybe every other ome
yeah def doable, most of the gpus are free atm, will set off
ok currently running 8 l1 sweep of pyth410, 80 chunks, layers [0,2,4,6,8,10,12]
Oh this is cool! Thanks for sharing
I think maybe we should check out using VAEs (or most likely, something similar with also-useful subcomponents) for this, like we might be able to achieve more robust/in-distribution activation engineering by sampling from e.g. the latent distribution conditioned on the 'deceptiveness' direction being higher than some amount (obviously an oversimplification, you'd also want to preserve other properties of an activation if you were to edit it, and you'd probably find directions in latent-space to preserve semi-autonomously)
(Also have some friends in Bristol that have been maybe been wanting to work with this for a bit, @thorny cypress and @coarse flint)
I can't actually think of that many benefits that doing a search over conditioned latent space has over searching over e.g. sphered data for a point with a certain magnitude in a certain direction and minimum distance to some other target point like you might do in concept editing
Unless, like, "covariance doesn't capture sparsity well" or something
Lol #1146607658179252286; Time to read all papers posted in this channel ever
Right it seems the exact thing you want here is the normalising flow autoencoder that @opal basin mentioned ages ago.
ohh?
i never did try guided sampling with it
but it might work!
also you can try training a diffusion model in a normal autoencoder/low-beta VAE latent space to sample from it and guiding the diffusion model sampling process to update the prior from the diffusion model with the evidence from the criterion, guided sampling totally works for diffusion
wait did you guys do the end to end training yet
have you seen the BERTflow paper https://arxiv.org/abs/2011.05864
Pre-trained contextual representations like BERT have achieved great success
in natural language processing. However, the sentence embeddings from the
pre-trained language models without fine-tuning have been found to poorly
capture semantic meaning of sentences. In this paper, we argue that the
semantic information in the BERT embeddings is not...
@keen pivot was working on it. In a week or so when I get back from holiday I want to do more with better autoencoders since ours are quite inaccurate at the moment.
No, interesting
Yep! Very basic result:
The dictionaries have much better perplexity for the same sparsity (though at the cost of reconstruction loss), but doesn't 100% match the original model's perplexity.
who cares about reconstruction loss 😛
I haven't checked the individual features that are different between them.
One general problem for our work is determining the "goodness" of the model.
I'm currently working on an automated monosemanticity metric for both the input activations & output. It's only on single-token features, but monosemantic on single-token-level features may imply monosemantic on other features.
I'm hoping this helps us actually measure what dictionaries are "better" or not.
Hehe I'm still predicting significant perf improvements (on both perplexity and reconstruction loss) when training better autoencoders, ours aren't even capable of accurately reconstructing data at the moment. I do worry that we'll get some accursed less-monosemantic solution if we switch autoencoder though, maybe something about the current setup incentivises monosemanticity compared to using FISTA for a given sparsity level
I think if you get better perplexity, reconstruction, AND similar sparsity, I expect it to be monosemantic.
Yes it's kind of a small worry, but I'm just slightly skeptical of the 'sparsity induces monosemanticity' story ATM I guess, partially because of the fact that bigger dicts don't work as well as you might expect ( @pallid current have you compared 'mean autointerp' or some weighting of that between dictionary sizes?)
yeah on the pythia70M i ran autointerp over different dict sizes, it's in the drat appendix atm
general pattern was for the first couple of layers, interp scores didnt changes with dict size and for middle-latish the interp generally got worse
which is correlated with there being clearer improvements in sparsity/fvu when increasing dict size for those early layers, while the improvement is minimal from about layer 2
Can you explain the connection with bigger dicts bad -> sparsity induces monosemanticity?
And to be clear, my claim is that for the same reconstruction loss & perplexity, a sparser solution is more monosemantic.
I now retract that, I got confused.
are there plots of the pre activation distribution of values for some arbitrary autoencoder neuron, along with the negative biases (ie where the ReLU truncates it)?
Not currently, we should definitely do that.
OpenAI has a different approach for finding directions that also used that visualisation, it'd be interesting to compare the location of our negative bias to theirs, however they set it.
Ok, solid claims
- I think some of the 'sparsity induces monosemanticity' phenomenon is 'our measures of monosemanticity are slightly gameable by sparse activations'
- I think that switching to FISTA will initially reveal solutions with higher sparsity and reconstruction accuracy, but lower monosemanticity, because FISTA is just a much better 'encoder' (similarly to the topk-pca thing from before)
Ffs why are you Leo gao I thought you were two different people, very sorry for confusing!!
no worries
First attempt at a monosemanticity measure. The peak is at 3e-4, which doesn't seem right to me. Though I think I'm too hungry to explain what experiment I did specifically
token entropy or something?
(Okay, getting food in 5 minutes)
I'm measuring monosemanticity on a token level, and looking at the features that activate for single-tokens only (e.g. periods, newlines, commas, etc). I assume that all LLMs will dedicate features to these, even if you scale, you'll get feature splitting (ie a feature that activated for all periods now splits into two features that activates for subsets of periods).
So I find these features & count the number of tokens they activate on for that single token, divided by total number of non-zero activations. Weighted means I account for how much that feature activates (e.g. "." with activation 8).
Okay, for next todo's:
- Verify these single-token features are indeed monosemantic in the low-l1 regime (in case of bugs in code)
- Check other features to see if they're monosemantic across l1's
Would be nice: have guaranteed features in a slightly more complex model (like TRACR code in superposition decomposed), then we can check if those features are monosemantic for a given l1 value.
got centered data running, trying an r4 l3 run at the moment
lmao
How are you doing this btw
im centering it when running setup_data. also normalizing variance. currently the means/stds aren't saved anywhere which needs to change but they're just the means and stds of the first chunk
As in, proper sphering/whitening? Are you decorrelating the data as well @pallid current?
no, just subtract mean and divide by std
Hmm
I think you should probably be decorrelating as well. You can use the BatchedPCA to implement efficientish sphering
i'm open to trying it but don't understand the transformation well enough to have a feel for how that would interact with the features
gonna have to mostly pass it onto you, i'm on holiday from tomorrw and its MATS final presentations today
It is true that these high sparsity (e.g. 350/500 d_model) dictionaries have single-token level monosemantic features. It is also true that many, many other features are polysemantic. I've now got a few more ideas:
-
Different monosemantic datasets - build a dataset of a feature that's more complex than single-token level. single-token level may be low-hanging fruit for the model (especially since I'm doing quite common words). So doing more complex features may be a better measure of monosemanticity in general.
-
Measuring how much a feature is just copying a dimension in the residual stream - residual stream is (mostly) polysemantic. So we can figure out how much a feature's [activations/variation] can be explained by 1 neuron basis element (I feel like there's an established way of doing this, but not familiar with it). Additionally, instead of looking at 1 feature at a time, we could look at specific datapoints & see if their encoding is more like an identity or not.
-
For a given sparsity, S, that means a token has S features activating on average. We could qualitatively examine a few example sentences to find the features that are most reconstructing it. e.g. For the sentence " Of the 5 donuts, he ate all 5", we could find all features that activate for the last token (i.e. " 5"), and make statements like "30% of reconstruction is 'single digits' feature, 25% is 'repeated token', etc". Better dictionaries will just make more sense here.
Like, normalising variance component-wise is definitely not a basis-invariant operation and I guess I don't want to induce any privileged basis on the activations
hmm dyou reckon we should either just mean center or fully whiten then? i was gonna just center to begin but then i thought about those outlier dimensions and added the stds
I mean I'd probably want to try both tbh
also the run failed at like 64 epochs for some reason but there's dicts in /mnt/ssd-cluster/pythia70m_centered/.../_63 if anyone wants to check on them
Could you run one on just mean-centered data before you go on holiday I guess?
yup can do
Lol how come
not sure, no proper error, it just kinda hung, said something about a wandb network error but not sure if that's symptom or cause
yeah i can do that, though rn it's slightly slow to run any decent analysis on it with new chunks etc, you'll prob have to do that
speaking of slow feedback loops, did anyone look at the gpt-2-small results? they look really good!
clear gains up to 96x ratio, and look at the y-axis - it starts at 0.02!
i know! im pretty shocked
need to get the interp on it asap
cant believe im gonna go on holiday to NYC instead of grinding on this 😭
Initial hypothesis
- pythia is incredibly uncentered, gpt2smal isn't maybe? This seems easy to test, just look at the means of everything
I remember Neel Nanda saying something about this, this seems low-cost-to-test-and-potentially-very-high-value
yeah hang on wasn't there some plot of the centeredness of different modelss
cant immediately find anything for pythia, gpt2 looks well centered https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights
I vaguely remember looking at pythia s and it wasn't, might be hallucinating tho
I'm having strong words with my past self if the whole problem was mean centeredness
That doesn't even make any sense 😕
this is latest result, 4x ratio
Time to rewrite everything
this is previous, mixed ratios
How long you in nyc for
A rewrite before ICLR is possible then maybe
Phew
I predict these directions are less useful meaning less monosemantic, otherwise my mental model for features in MLPs is broken, but tbh I didn't trust it before anyway
Yeah agreed from a theoretical perspective I don't understand it in the mlp, except maybe just to say that the mlp may be used to perform arbitrary function approx that isn't very tied to the neurons themselves, and this still exhibits a sparse structure
But I think that's mostly already true for our normal sc on mlps
Will have to actually interp and see
set off a run at ratio [4,8,16,32] residual stream centered all layers
still unit variance but not decorrelated (soz)
I'm kinda worried that doing unit variance without decorrelating with squish important info and maybe harm perf possibly idk
will have a quick look after the first model is done and see if it looks decent, will cancel and rerun without altering variance if not, otherwise will try set off in a couple days or maybe airport lol
Enjoy NYC btw
idly wondering whether
- GPT-J and the Pythia-v0 models are also uncentered
- whether gpt-neo models are centered
because something that pops out to me is the initialization scheme used for gpt-j and Pythia (section 2.1.3 of the neox-20b paper) is different than standard. this is just a drive-by observation, feel free to disregard 😅
Ah, no that maybe rings a bell 🤔 might well be the reason, we shall see!
ok got some results back from large centered runs (including unit var, not fully whitened) on pythia70m, first impressions are that it's not doing anything too crazy
even on l5 surprisingly
i think the actual takeaway from the last day or two might be that gpt2sm >> pythia70m
plane is boarding now no time to test whitening but set off a run without zerovar just to compare
priority should be to run more tests on those large gpt2sm dicts tho imo
Second hypothesis: something something outliers. Do they differ significantly between the different models?
Wait, wtf. I can look at these, and check their perplexity
Hypothesis on the difference in FVU between Pythia and GPT small?
They both have outliers, but I think Pythia also had them for the first delimiter (eg period/newline) but not GPT small
What about their magnitude
I guess I'd also be interested in FVU against sphered when trained on sphered
WAIT ughhh deleted the graphs above they are bs, i didnt pass the argument to the HF activation function only the baukit one 🤦♂️
norm(mean) by layer:
still worried i've done something wrong somehow but redone it with centered data (no unit variance this time) and not seeing any diff in l5 perf
I'm not getting crazy low reconstruction loss to match this graph
Maybe crisis of confusion averted
Still think that philosophically speaking we should be centering/allowing learned center points but initialise to mean-center
I did reconstruction, not FVU, but they should be similar
@keen pivot hahaha wait how? I can't go back at check the code rn but i dont even know what bug would get the fvu wrong
Unless the fvu isn't normalized properly when switching to gpt2sm?
Tho I think the most interesting thing was the fact that 96x was better than 64x, do you still see that?
I have not noticed more scaling is better for perplexity at least!
Wasn't that also true for certain layers of pythia though?
Oh, I'm not testing on data it was trained on, so maybe!
Hmm yeah layer 0 and 1 so yeah not that surprising for layer 2 gpt2sm tbf
I'm doing layer 4
So the original image was just layers 0 & 1?
Original was 2 I'm pretty sure, there's only 1 layer which has the big dicts i think?
Layers 2/4/6/8 all have it!
Not the best graph, but here's perplexities for layer 4
Oh note: I don't think the first column is original, because it shouldn't change I think (unless I'm shuffling data)
Wait so did you try to replicate my original L2 Graph?
Nope! It’s just what I had running and it finished running.
I’m hoping tomorrow we’ll get it figured out
Nice, sorry to drop a bomb and then go on hol lol
Figured it out: gpt2 has large activations, so variance is large, so FVU is smaller (since FVU = MSE/Var)
Surely the MSE would be similarly scaled though? What is 'median activation' here exactly? Median activation norm?
I just got the batch at layer N, and took the median
Elementwise?
I don't think there was much difference in MSE between Pythia & gpt2 small. Can check
Median of everything, lol
Ok, so I'm quite confused as to why that's on the graph? Like, what are you trying to show with it? (Just slightly confused 😅)
Just general statistics. I don't know what I'm doing
But ya, I think max-activation -> high variance is the thing
I'd be very surprised if our autoencoders were significantly not scale-invariant, unless something something initialization magnitude or whatever.
Like, maybe get a batch of pythia data, scale so that mean pythia activation mag = mean gpt2small activation mag, and see what happens to the FVU
Like the difference between a variance of 35 & 0.5 is 70x, which I think fully explains the FVU difference
I think just dividing by 70 in this case is also correct thing from my above message
I'm really confused as to how that could be the case.
Uh, sure, try that
What the hell
You're right. Hoagy's results are more like FVU/20
Oh I misread this
Could you plot the unscaled things as well?
Wait, are you retraining?
Wait, what are the lines here?
FVU of pythia-70m layer 2
But what are the different lines?
FVU, FVU/20, FVU/70
Ok, so I think maybe you misunderstood what I meant;
I'm confused as to why FVU would change so much with activation mag, so I wanted to test this by training autoencoders on scaled activations but the same underlying dataset, and see if anything changes
I should be able to plot MSE from the pythia one & gpt2 small & gpt2 small will be 3x better.
Wait, why? isn't FVU calculated by MSE/var & var is affected by large values?
To test the hypothesis that the change in magnitude is the thing that is causing the error change, and not anything else structural
But so is MSE
Oh, I think I have access to the scaled Pythia ones
Not if the large values are explained by outlier dimensions which the dictionary is super good at capturing
I think it's probably better to scale by average magnitude tbh, since the scaling this is not rotation invariant
Ok, so
- I was counting this as a structural difference (i.e., more of the norm is in the outlier dimension)
- could we look at the outlier dimensions?
Sorry for the confusion 😅
So the pythia-70m_centered in /mnt is bad because it's not scaled correctly?
What statistic on outlier dimensions?
Hmm 🤔 maybe norm of max dimension/norm of activation, or look at the distributions for dimensions in the normal basis?
Yeah, it wasn't uncorrelated, so it could have squished things strangely compared to the outlier. However, I think this is maybe fine since the outliers are mostly basis-aligned?
It's pretty cheap to look at it now and do stuff with it.
I'm not confident in me quickly setting off a run making it uncorrelated
I'm back home this evening, I could set one off maybe
Does normal basis mean just residual basis?
Yeah
What are the axies here?
They both have only like 7-10 extreme values.
Histogram: x is activation
Ok! So this seems like a significant difference. Pythia's outliers are waaaay smaller in terms of standard deviations outside mean
So the hypothesis of "Sparse autoencoders capture the outlier dimensions well, so FVU will change drastically depending on outlier dimension std"?
Sure, something like that.
So no holy grail by just centering data?
Did you find perplexity changes significantly between gpt-2 and pythia?
No, unf I don't think
Still think we should be doing this though
Hoagy has an image showing some improvement doing it.
Trying to work out whether downgrading the importance of getting the outliers exactly right, i.e. by whitening/sphering, is a reasonable thing to do
Yeah I think it'll definitely improve perf, but I also don't think it fully explains (or even mostly explains, given the results you just presented) the improvement
I think given the other results which didn't seem to show much improvement we should be careful to plot these on the same graph
So hard to really read on diff graphs with diff axes
What are "these"?
Those fvu vs sparsity graphs you linked that I sent a couple days ago
I can do a comparison of KL-div under various transformations (mean-centering, sphering) on pythia when I get back
Broadly think KL-div is the thing we should focus on minimising
KL-div stuff worked on the gpt-2 models pretrained.
It's a signifant difference! Let me get a better graph between the two.
Trouble is, I also don't know what the reasonable comparison for relative perplexity diff between models is ://
I think they're close enough in this case that you can compare approximate apples-to-apples
@bitter turtle, pretty big differences!
Oh, nevermind. I'm such a loooser. I need to compare by sparsity
You're just objectively correct though
I mean the objectively correct thing to do would be axhline or something
Oh ya, that works too
Hmm yeah pbbly
Okay, I think I want to plot by both sparsity & sparsity/d_model, but I expect them to mostly be the same
Looks good!
Like very good!
This was run on ~250k tokens for calculating perplexity
Note on KL: how is the model normally trained w/ EOT tokens? Is it masked? Does this effect how we'd want our autoencoders to have low KL-div with the model, or is this a total nothing-burger of a concern?
I could also look at datapoints that the reconstructed model is worse at predicting & see if there's some statistic that separates them. For example, maybe it is mostly high activating datapoints.
ok this is like the opposite of what I expected maybe? kind of implies we shouldn't be sphering things
Can you elaborate?
If we sphere, it downweighs the importance of the outlier dims wrt MSE, and maybe the above graph shows that outliers relatively more important since the GPT2 one preserves them better or something?
Obvs will actually test
I think we can see the MSE for just the outlier dimension, averaged over the batch & compare w/ GPT2 & Pythia. Though I'm unsure how to pull out just the outlier dimension component of MSE.
Yeah when I set it off I'm going to log that
Pbbly tmmrw got home too late
I'm going to approximate by 'FUV for largest activating component'
If we're looking at this from the perspective that L1 acts to disentangle latents maybe it'd be interesting to implement something like https://arxiv.org/abs/2205.05862 with a sparse autoencoder
Variational Auto-Encoder (VAE) has become the de-facto learning paradigm in
achieving representation learning and generation for natural language at the
same time. Nevertheless, existing VAE-based language models either employ
elementary RNNs, which is not powerful to handle complex works in the
multi-task situation, or fine-tunes two pre-traine...
Relevant tldr diagram
wtf.
tbf I haven't tried 'KL under reconstruction' before so not sure what I should expect
this is also layer 5
so maybe thats weird
yeah looks like a product of being layer 5, this is layer 2
not seeing too much difference between centered vs not
Okay, this is how I'm framing this:
We know gpt2 does better on perplexity for a given sparsity (which perplexity probably transfers to KL?). We could find all statistical differences between the two, and see if we could replicate one to the other (either pythia->gpt2 or vice versa).
not sure what you're addressing here (or what 'this' here refers to)
Like you & hoagy have been doing centering data in order to replicate the goodness of dicts on gpt2.
Since centering didn't replicate, there may be other statistics of gpt2 data that are responsible.
yeah im like 80% sure it's the rel size difference of the outliers
I would probably expect e.g. the sphered results for gpt-2 small and pythia to look approximately equal if this is the case
Weren't you going to do the FVU on the outlier dimensions?
yes, haven't got around to that yet
I'm currently just seeing the correct L1 range for the sphered data
Okay, so I do reconstruction for both, but don't reconstruct the outlier dimensions, just pass those through. If perplexities are about equal then, then it's the outlier dimensions
What do you mean just seeing?
Like that's what you're currently working on?
Oh, ya.
You just do the sparsity?
wdym?
Like select l1's to get a features/datapoint (ie sparsity) between 5 & d_model.
oh, yeah
oh, I changed something and now it's the same as the other ones, duh
I'm dumb
relevant info; seems about the same weirdly
this is layer 3
unsure what's going on with sphered data in the low-sparsity setting, perhaps they are undertrained
the sphered one looks like layer 5.
maybe 'approximately equal' was a bit extreme 😅 but I think they look similar at least
I would maybe put this as "not conclusive but decent evidence that it's mostly the outliers"
this is gpt-2-small, running pythia-160m which should be more equivalent (in model size) now:
pythia-160m is a good choice!
What's the graph? You say
this is gpt-2-small, running pythia-160m
, so is the graph gpt2 or pythia160m?
oh, this is gpt-2 small
pythia-160m 🤔
I'd say there is definitely some other structural difference here then
(in addition to the outliers being vastly more significant compared to the norm)
I don't think this is a very informative graph, but
gpt2-small
pythia
contribution looks ~constant by sparsity
basically shows that 'if you only allow the top-2 directions (the outliers), gpt-2 has better performance than pythia' which would be evidence for 'gpt-2 is better because it can proportionally represent outliers better'
@keen pivot
Would you be able to put the two on the same plot, and much different colors?
they have the same axes, but yees?
Wait no!
????
I think this makes sense, like outliers were typically very positive or very negative, and not both, so they can't nicely be captured by a sparse code which is mean-centered
How does centering relate to outliers? If you center, then outliers contribute less to variance?
well, basically I don't think the outlier dims are mean-centered particularly.
They're not
And doing this to other dimensions has little effect?
doing what to other dimensions?
mean centering
seemingly overall it has little effect
overall I think mean centering is 'closer' to the correct value
hold on, median-centering might be closer to being correct
I did try learning dictionaries that didn't include the outlier dimensions, & they didn't have much improved performance.
I should get on the perplexity-but-exclude-outlier-dimensions things
No big difference when removing top-2 outlier dims
I mean, a big difference for the really awful l1s' for pythia
And zooming in:
And for sure, the perplexity-diff goes to 0 if you use my code and find the top 500 outlier dims, replacing them, so that probably works.
So outlier dimensions don't explain the difference in perplexity
Could you summarise your interpretations of these plots?
Maybe the difference in perplexity is because gpt2 better reconstructs its outlier dimensions, so let's just run the perplexity-under-reconstruction both normally and when "carrying through" 2 outlier dimensions (ie just replace the reconstruction of the outlier dimensions with the actual outlier dimensions).
If the "carry through top-2 outlier dims" one makes the perplexities match more between gpt2 & pythia, then the cause is the outlier dimensions.
In the graph above, it doesn't really make a difference at all, so the outlier dimensions are not the cause
Hmm, I don't think GPT-2 "better reconstructs" it's outliers necessarily, I'd expect e.g. FUV on only the outlier dimensions (note that this is different to what I tested before, before I tested 'FUV on the entire thing, but only allowing outlier features to be active') to be about the same between the models.
I just think that both dicts learn to predict the outliers fairly well/almost perfectly, since there's a strong incentive to do so compared to all the other dimensions. Then, GPT-2 gets better overall FUV because more of the norm is in something it learns perfectly, the outliers, hence the lack of a change in your plots.
@keen pivot
sorry stupid question is this using the L2 loss or the cross entropy loss? if you train the autoencoder to minimize CE it should just automatically do the right thing with the outliers right?
Ah, we where trying to explain the difference in performance between our GPT2-small and Pythia models, I think the end-to-end training thing is separate.
ok
erasure across depth scores for 410m, slightly wild
I'm still concerned I'm using LEACE unfairly, I might try and 'fake' a more in-distribution dataset by prepending random samples of the Pile or something.
runs using 10-shot and 6-shot prompting respectively. (the one on the right, where LEACE is worse, is 6-shot prompting)
Example prompt so you get a feel for what I'm doing:
My name is Connie and I am a female. My name is Eva and I am a female. My name is Mary and I am a male. My name is Paris and I am a female. My name is Jamie and I am a female. My name is Dorothy and I am a female. My name is Edison and I am a male. My name is Alex and I am a male. My name is
Maurice and I am a male. My name is Cary and I am a male. My name is Ana and I am a[completion]
kind of spooky dataset tbh
The only thing it doesn’t predict is the perplexity is better for GPT2 dicts.
Why is Mary a male? Unless this is just an example
what
Why show the mean edit magnitude over the KL diff? I also have no idea what the mean edit magnitude means.
checking dataset rn
I mean, Mary can be whoever they want to be, but like stereotypes, you know?
@bitter turtle, making a circuit for a few layers from the feature you found for gender would be cool. Like I can manually interp the ones in layer 8 which seem to do good, & back-chain to previous layer features. I could also do this for mlp_out & attn_out.
kl divergences for ablations on (presumably flawed 🤔) datase
turns out I had set the threshold for 'name commonness' to an OOM less than I meant to
rerunning
is there some concise explanation of what this experiment is about
"test how different erasure techniques impact model capabilities on a simple task"
(except doing concept erasure at a specific layer, and not concept scrubbing)
got it, what concept are you erasing?
gender prediction from name
got it. how does this work for the dictionary feature thing?
like you just locate the feature most correlated with gender?
no, it's stupider than that; filter for features above a freq. activation threshold, and compare their erasure ability on a test dataset
what is erasure ability? like fitting a probe on the post-erasure activations?
and seeing the loss
sorry, end-to-end model score on the test dataset
oh ok so the model itself is prompted to predict gender and then you also try erasing gender from the activations and see how bad that makes the model?
What is "mean" here?
Also how do you actually perform the dictionary-based erasure
project the direction learnt by the dictionary to 0
project difference-in-means vector to 0
so it's an orthogonal projection?
yes
in the like, activation space
yes
not in some higher dimensional space in which the overcomplete basis is orthogonal
okay
What LEACE does is very similar to this, as I'm sure you know. Both methods are linear projections with the same null space and only differ in their column space
On the layers where the edit magnitude is higher for LEACE than for orthogonally projecting onto the difference in means direction, there must be some bug or some distribution shift because we prove that LEACE can be no worse than the orthogonal projection here
So I'd first want to investigate this issue
yes, I was very confused by that.
And once you've sorted that out, if LEACE still has a smaller effect on perf than orthogonal projection, then I would say that's expected
And the extra causal effect of orthogonal projection is due to side effects
what do you mean by 'distribution shift' here?
like you fit the eraser on one distribution and apply it on another?
ah, ok, no that shouldn't be happening
1000 prompts or so
I suppose the other issue is we mean ablate rather than zero ablate
I would do a sanity check with setting method="orth" and affine=False on LeaceFitter because in theory that should give identical results to your Mean method
yep
I should note that while we did do one of these destructive intervention experiments in the LEACE paper
I think that the gold standard should really be to achieve fine grained control over model output
That's much harder to achieve and also more practically useful
In particular I think evaluating methods by how much they screw up the model is sort of perverse; in the LEACE paper we actually do the opposite (lower perplexity / KL is better) since that suggests better surgicality
There are lots of ways to screw up a model; trivially you can replace activations with constants or i.i.d. noise
I'm kind of trying to do this here
lower perplexity is better?
I'm evaluating KL divergence from the base model under the fitted intervention on a subset of the Pile
mmm
@glass tinsel this is KL-divergence on the Pile under the learned interventions (ignore LEACE probably); the idea was that since our directions are sparsely activating, hopefully we might see that ablating them brings the activations less out-of-distribution, and so we see less KL divergence. However, it seems to be very dataset-dependent
hmm yeah I think you always need to be looking at both effectiveness and surgicality simultaneously
bc the identity intervention achieves perfect surgicality
(fwiw I think the LEACE paper should have been better at this, and I'm trying to think of ways to measure it rn)
to clarify I'm not actually that hopeful about this actually being a useful concept editing method, but was just trying to illustrate a possible application/evidence for the learned directions being 'meaningful'
hmm
See I could imagine that this might actually be a useful method
the end to end version
not the reconstruction loss version
bc the end to end version is taking into account which directions are most important
agree
the autoencoders are not good enough for that yet, though (they are fairly lossy), and I don't really see the benefit of this over e.g. activation editing + good understanding of transformer activation pdf for not-too-insanely-ood activation engineering
how are you planning to make the autoencoders better
I think FISTA for actually good and not terrible sparse coding given a learned dictionary is probably the safest bet.
you can do something like K-SVD to optimise dictionaries using FISTA as an encoder
fixed the distributional shift (I was ablating from all token positions, including the prompt), and we get these results.
Is LEACE just bad here because it's only one layer, or also the few shot prompting/iid thing?
weird mix of reasons, mostly because it shifts activations ood
kl for these results. holy shit I need to optimise, this took ages
fwiw it's not clear to me that these results are "bad" for LEACE— they could be interpreted as good because LEACE is being more surgical. The other methods may be harming prediction ability through "spurious" channels
If you want to go hardcore you could do Quadratic LEACE. I'm actually kind of curious how big the edit magnitude would be. Haven't gotten around to measuring that for the paper yet.
I meant to test this via transfer to a different dataset, which is kind of like most of the point, but never got around to it
Isn't this shown in the KL divergence though?
The surgicality is, yes, but you can't reasonably conclude it's not doing something spurious without transfer
Is this KL on the gender dataset, and transfer is KL on e.g. Pile-10k?
no, this is KL on pile 10k, and transfer would e.g. be using the same interventions to measure perf change on a different but correlated task
I think you gotta look at the specific features being ablated to do better generalization. Dictionaries give you that option (since there are many features found).
energy