#Sparse Coding
5293 messages · Page 6 of 6 (latest)
I'd like to have the top-5 of those features found, then provide data on which inputs they activate on & which outputs they effect (including the direction found by LEACE if that's meaningful?).
Though the output of that's going to be kind of lame. Like to get the effect on the output, you run on a lot of data, which will include part of the distribution you'd be testing for.
yes, the activation frequency thing is basically just pointless here imo. output might be interesting.
Though I think this is just the way it is & it's fine? Like dictionaries allow you to select from multiple different features, and alternative methods don't.
I did look at activations for ablated features in the previous run btw
How much it activates? Ya not useful!
or rather, the token set it activates on
still don't think that this is this useful, for the reasons you say "lame"
output cool however
I think the output is lame in the same reason: you're directly checking the test distribution (probably) when finding the effect on the output on e.g. Pile-10k.
also turns out i've been incorrectly plotting activation edit amount this entire time, it should be for the other experiment
which is weird because it's more surgical than LEACE under that measure
this is for feature selection based off linear erasure amount, i.e. how badly a logistic regression model performs when trying to linearly discriminate based off the activations under the projection sending a dictionary component to nullspace
this being said, the optimal feature to ablate here is often the second or third best one under this measure of goodness
The optimal feature to ablate (optimal as in leads to reduced prediction ability when ablated) is 2nd or 3rd under measure for logistic regression?
tried out transfer for a Very Shit dataset which is basically a rehash of the other one for pronoun instead of gender prediction
original dataset
transferred results
nowhere near robust but maybe promising
this is 'best of top-4 best dictionaries for other dataset' as judged by this measure
@keen pivot I can dm you the dictionary indexes now
or in a few mins, sorry
Summarisation of current alg:
- get 4 candidate directions via testing for erasure with linear binary classifiers
- sort by perf on test dataset
- measure perf on main and other dataset
This is currently a Very Shit and Uncool algorithm, but I have hope that something similar would work with a more sanely designed autoencoder
probably not that relevant anymore, but maybe useful for other people's stuff, I figured out that our previous attempts at larger/deeper autoencoders basically failed bc they had a fuckton of dead neurons at the final thresholding layer, but this can be remidied by using something like e.g. softplus instead of the ReLU
@bitter turtle , do you know how we're downloading data now that the pile's down? I basically want like 2 billion tokens, but openwebtext is like 8M datapoints (want more like 100k), and I don't know how to download it or to get streaming=True to do what I intuitively want.
I am honestly Suffering, and just firming downloading openwebtext
there is definately a smarter way but I haven't Suffered enough to bother to look
How many tokens do we train our models on normally?
I could just host a openwebtext-100k, which'd be one solution
1 chunk is ~2M tokens I think, for 512-d resid
halfs for double activation size
Sick
So I'm just setting off a run for training on openwebtext w/ gpt2 w/ KL & perplexity.
So maybe this will just converge to model's original perplexity and we're done as far as performance (everything else would just be converging more efficiently!)
Downloading the dataset took like 20 min. Chunking & tokenizing is like 1hr. I'll just check back on it tomorrow & hope there's no weird bug that happens (the wandb above is from running on pile-10k, so it works on that at least!)
So maybe this will just converge to model's original perplexity and we're done as far as performance (everything else would just be converging more efficiently!)
Hmm, I wouldn't think so if we are trying to use these AEs to study the behaviour of a preexisting LM; I expect that it would be possible for the AE to improve perplexity by doing nontrivial computation rather than 'reconstructing activations to have the same semantic meaning' or whatever. This is fine for studying e.g. a model trained with many high-rank sparse disentanglement layers as part of it's arch (like CRATE or something https://arxiv.org/abs/2306.01129), which is for sure something I want to do, but I worry it wouldn't be as useful for studying the non-sparse-expanded model @keen pivot
Sorry for hogging GPUs @keen pivot
doing acdc-type-stuff with dicts is expensive asf
But it's trained on KL.
Though I do agree that something like CRATE would be cool cause functionally equivalent is really want we want.
even so, it could be doing something accursed, less likely with KL tho
Wait, how would functionally equivalent be accursed?
internals might be different
seems significantly less likely/problematic tho
as in practically irrelevantly problematic
Do you know what data the gpt2 models were trained on? Pile or openwebtext?
not a clue sorry
Perplexity for gpt2: 28
Perplexity for dictionary: 32
This is before training on KL on openwebtext. That's pretty good! On Pile-10k it was (30, 40) for (gpt2, dict-base)
got these done for layers 4,6..18, took an insane amount of time (hours)
will probably use 12 because it looks the most aesthetic/illustrates what I want to illustrate the best, will include others in appendix or smthn maybe
@bronze wraith do these look better in B+W?
I'm worried that they aren't that distinguishable
(less bumpy one is normal dataset, more bumpy one is untrained transfer)
yep, those are fine in black and white!
After training directly on KL & reconstruction we went from 32.6->30.6 (it converged!) with 28 being the original perplexity. Very good! lol.
I want to check which datapoints are badly represented to see if there's a trend (though, it might be because it's not a big enough dictionary to capture all the features! currently doing a 6x dictionary)
The larger perplexity difference last time was (probably) due to the Pile-10k which was a different distribution than Openwebtext
Running one on 16x ratio & checking # of dead features now for both 8x & 16x (currently 16x has 4k-5k, dead which is ~40% of the features. Definitely need to do that soft ReLU that @bitter turtle mentioned)
mhmm would be interesting
Remark: if you LEACE away the gender concept in the residual stream at layer i, the tuned lens output at layer i will have no better than chance gender prediction accuracy. So if the final layer output is better than chance this has to mean that later layers are recovering the gender info in their residual updates.
You could track this process with the tuned lens
This is a really cool suggestion, I'll try to look at this soonish!
I expect a linear probe to be able to discern gender under the dictionary ablation, but I'm not sure what tuned lens would do.
Yeah for sure think we should do a comparison
Between purely ReLU and softplus
For our single-layer models
@pallid current would it be possible to do an ICA run for pythia-410m at layers [0,2,4,6,8,10,12,14,16,18,20,22], or would that take too long to converge?
Also I realise this is maybe a slightly unfair comparison, do you think I should be comparing to PCA-but-its-doubled-for-only-positive-components as well @pallid current @keen pivot?
Trouble is no-one has really done decomp of residual stream measured like this so I am essentially making up baselines
I think the PCA is fine as is.
Maybe it would be interesting to contrast both
putting this here for tomorrow
Hi 👋 I was pointed to sparse coding / superposition. I'm wondering if there is a canonical reference on the models people are fitting?
For context, I read some papers and may have some outsider insights. I don't want to wade in without checking how far people are doing the same things under sparse coding / superposition.
The main approach is basically as its described here: https://www.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition
Thanks for clarifying!
For code versions (which we have many model types we've tried!) look here: https://github.com/HoagyC/sparse_coding/blob/main/autoencoders/learned_dict.py#L124
@keen pivot
So, I'm still pretty worried / unconvinced by the experiments in Section 4 of the arXiv paper tbh
An alternative hypothesis that explains the data is something like "the dictionary ablation is messing up the model's capabilities more overall"
and the mean squared error doesn't necessarily capture this. one of the nice things about LEACE is that it actually minimizes a bunch of different squared error metrics all at once
It's sadly in the appendix, though I've wanted it in the main paper.
Yes I think it's quite important for your case.
Maybe we can convince Aidan to hot-swap it in for ICLR submission.
And I mean, maybe I'm just misunderstanding something or I'm biased toward my own method, but on priors it just seems really bizarre that the reconstruction loss-based dictionary would Pareto dominate LEACE here. I don't see how this could happen, so I'm still worried something is wrong in the experimental setup
if you're using an end-to-end loss, it makes a lot of sense. you're directly minimizing the effect on the model when learning the dictionary
but this just seems like magic
On another note, it seems like the end-to-end loss is strictly better across the board, and you have it working, so I'm a bit confused why you didn't exclusively use that for the paper
~~Project ~~Future creep
I mean, do you have all the results for end to end?
You'll probably have a better chance to get in at ICLR with it!
There's nothing to lose!
It makes more sense, and is more in line with the tuned lens stuff we already did!
Nope! Don't even have the code base properly set up to run it easily
But yes! It would be much better!
okay if I had me or someone else dedicate a day to fixing it
like
what would it take to just switch this lol
because I feel like this is fairly important
I expect we could code it & re-do all results in a week, which is in time for the deadline.
In details, imo:
- slight rewrite of intro for loss function
- (possibly) Re-run auto-interp results (lowest priority)
- Rerun concept erasure (high priority)
- Rerun IOI (med priority)
- Rerun features on dictionary (med priority)
- Rerun Auto-circuit (in the works already, so no big deal)
- Extra section (or appendix cause 9 pages) showing perplexity-under-reconstruction.
Code:
- (Pretrain on reconstruction only) function that loads in a model & runs on KL-divergence
Ya, I think separating a pretraining & KL is good because of the extra compute cost. Larger models might not even be possible to train on KL given GPU constraints (or a headache if it's across clusters).
No commitment though! I'll mention it to them tomorrow morning when we meet up.
So I kind of think you're implementing LEACE wrong because I would expect both LEACE and diff-in-means to achieve basically 0.5 prediction ability by the last couple layers. I see that with diff-in-means but not LEACE.
More generally like, I wouldn't expect LEACE and diff in means to be dramatically different
@bitter turtle Did you ever sanity check that setting method="orth" and affine=False on LeaceFitter gives the same result as diff in means?
the dict feature curve here is highly noisy. like if you compute the area under the curve it's not obvious that dict features beat diff in means ablation
also, LEACE doing super well at layer 2 but not at later layers also seems sus
in general I wouldn't really expect any of these methods to do well at all at early layers
because the model is going to recoup performance
if y'all have like, a particular script or notebook you're using to generate the results I'd be interested in inspecting it
otoh, I guess it depends on the prompt a bit
https://twitter.com/_akhaliq/status/1703600599722279400 is this the paper for this thread?
Sparse Autoencoders Find Highly Interpretable Features in Language Models
paper page: https://t.co/0zrBV222od
One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct…
I gotta be honest it's stressing me out that AK tweeted this out when I feel really unconfident in the results
can't be retracted now
Going forward just... if y'all are going to put an EleutherAI affiliation on an interp paper I'd like to review it okay? I think there was some miscommunication about like, the extent to which I had seen this
yes, it produced exactly the same curve as diff-in-means.
very sorry about this.
did you do orth and affine True?
just trying to get to the bottom of the actual difference
maybe relevantly I did a comparison against a randomly-initialised dictionary baseline, (also will do one against an l1_alpha=0 baseline)
ohhh my
it's a bit... h
wait so did you do this experiment?
no, hotfixing it in now
okay cool thank you
affine does seem like the kind of thing which would explain the weirdness
honestly the weirdest plot is this one
because... how is the model getting 0.6 acc when you're making the genders linearly indistinguishable at layer 22
with leace
but then not with diff in means
how are you numbering the layers exactly? is zero the output of the first transformer layer or the embeddings
layer 0 is output of first layer
so layer 23 would be the thing that gets fed into the unembedding?
maybe important detail; we use few-shot prompting, and are only fitting LEACE to the last prompt (the one we actually want to intervene on, and have cleanish labelling for)
Why?
do you do the same for diff in means?
why we use few-shot or why we only intervene on the final prompt
What do you mean by final prompt
so like the task might be pronoun prediction, and we go
Bob went to the store where he bought a cat. Carol went to the store where she bought a dog. Rob went to the store where [ask it for completion]
the final prompt begins at Rob
okay so you're not talking about separate prompt templates or anything like that
nono
okay so if you don't mind, my two main requests for empirical results are:
- run this with method="orth", affine=True on LeaceFitter
- show layer 23
can do
this is normal, orth+not affine, orth+affine
I'm about to pass out bc it's 2 AM here but
the mean prediction ability can't be right? the unembedding is a linear classifier and we just prove in the LEACE paper that you can't do better than chance for any convex loss anyway if the means of the classes are equal
and while acc isn't convex
it would be pretty bizarre
if it were that different
yep position of intervention weirdness?
ok going to sleep now
OTOH if you only fit on & ablate the last token position instead of flattening & treating as IID & intervening on all, you have the expected behaviour @glass tinsel
I feel like this could explain the difference
I did debate only intervening on the last position for a while ages ago, because of the LEACE performance weirdness
but I didn't because I am viewing this as a form of deep steering; this conceptualisation of it should be stressed more in the paper, this is an oversight on my part
@pallid current for this post: https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tentatively-found-600-monosemantic-features-in-a-small-lm
Am I right in saying that dictionary 0 is just another portion of decoder weights but for a way smaller dictionary size?
Sorry just realized you weren't the author of that post
yeah logan wrote it, but yeah for that post dictionary 0 and dictionary 1 are just any two sets of decoder weights trained on activations for the same layer of the same model
often the pattern is that dict 0 is a smaller dict being compared to the next size up but there's nothing fundamental about that, we're just trying to understand whether multiple similar sparse autoencoders are learning the same direction as a proxy for that direction being a good feature
Makes sense, thanks
yes, but as we do interventions on all token positions the classifier isn't strictly linear; the reason we do interventions at all token positions is because of the framing as deep guidance
this feels like more of a dataset/task issue than anything
GOOOD I’m not going crazy
Absolutely
I don’t think I fully grokked this before. I know you said something about fitting on all token positions at one point but I didn’t know that’s how you were doing it for this particular experiment
I don’t think fitting and intervening on all residual stream positions is the way to do deep steering
If anything deep steering should be done in the key value cache
v happy i am not either tbh
ah, sorry for miscommunication
I think this is plausibly doable for the experiment? we have tried some dicts on non-residual-stream-stuff
Yeah I'd support trying this
with the end to end loss
lol
is this confounded by the fact that dictionary features are "smaller" than PCA features bc they form an overcomplete basis?
Yes, but we are able to achieve better performance in terms of amount-of-activation-norm-changed
I'd like to rerun these results with positive-only-pca as well.
Could you explain more of your intuitions about this btw?
So for one thing, changing stuff in the residual stream means you are "directly" changing the token predictions through the identity branch, in addition to changing the transformer layer outputs
which is likely going to cause more changes than you really want
More generally I think we should be trying to figure out ways to only do an erasure or concept edit when you really need to
Perhaps using Mahalanobis distance from the training distribution of the concept eraser/editor
mhmm, this seems reasonable; I guess with sparsely activating features you can kind of target this by only intervening when the activation is nonzero
wait wait wait
so when you rank them
are you ranking based on the final token position
yes, this is highly problematic and i kind of hate this experiment etc etc aeeeeeeee
Yes this experiment should be removed ASAP
it's totally misleading tbh
because like
this causes you to basically find the dictionary feature that is closest to the LEACE or diff in meands direction on the last token position
which obviously is going to do better
and then you compare against LEACE and diff in means fit on the whole sequence
how would you recommend addressing this
step one is to update the arxiv paper today removing the whole section
oh, ofc
step two: let me think a sec
other than that
I think it's clear that this should happen by now, but I could have really done with having this feedback earlier
yep, I would have given the feedback earlier if I knew the paper was going to be put on arxiv
in any case
we will do better in the future
communication is cursed
So, I think this is going to be a bit tricky in general because LEACE and diff in means are sort of designed for concept erasure out of the box, whereas with dictionary features you need to do extra optimization
for sure
but at minimum, the type of optimization you do for dictionary features
should be like, basically identical to the kind you do for LEACE
it should be the same objective
that is being optimized
either that or like
you need to have some argument
that hety
hey*
in the actual world
there's some reason you can't do this optimization problem with LEACE but you can with dictionary features
or smth
I expect that in any "fair fight" between LEACE and reconstruction loss based dictionary features LEACE is just going to win
your hope for beating LEACE is to use end to end loss
in part because dictionary features are restricting themselves to orthogonal projections
it might also help to like
take the dictionary subspace
say, this is the "concept subspace" to neutralize
then LEACE that
because that'll get you better surgicality
I still think it'll do worse than LEAC'ing away gender directly if your dictionary is based on reconstruction loss
but it might help with beating LEACE when you rerun the experiment with end to end loss
on another note, I don't actually understand why we're using an orthogonal projection in the residual stream here, it seems like you could do some galaxy brained thing to make an even smaller edit using the overcomplete basis
this would actually be fascinating
literally just ease and algorithmic complexity; this motivates the IOI feature identification experiment
do we know of a way of doing overcomplete projections
if I sat down and thought about it I could probably derive something
wdym
specifically
I'm not totally sure, but some edit that actually makes use of the overcomplete basis
i dont know what the prior should be on the dict_feature vs LEACE or diff in means but i do think there's a clear potential reason why learned features would win - because they take advantage of pretraining to find semantically meaningful features, so even with a small sample size you might be able to grab exactly the right direction, while with a sample size of 30 in a 512 dimensional space, your diff-in-means or LEACE direction will have a lot of noise
(when moving to a separate test set)
now this is just confusing me, is this another confounder 👀
you can pretrain LEACE too
in a deep way?
yes
how
hold on I have a call rn
ok
Ok so there’s a few things you can do
The most annoying part of fitting a LEACE eraser is estimating the covariance matrix of X because it’s O(d^2) parameters
I'm more concerned about how you label things accurately in the middle of a model
We use this thing https://arxiv.org/abs/1308.2608
In this work we construct an optimal linear shrinkage estimator for the covariance matrix in high dimensions. The recent results from the random matrix theory allow us to find the asymptotic deterministic equivalents of the optimal shrinkage intensities and estimate them consistently. The developed distribution-free estimators obey almost surely...
Estimating a covariance matrix is fine mod weird instabilities and accidental negative eigenvalues, right, or is that the issue?
It’s not really negative eigenvalues, you can ensure it’s always psd
The sample covariance matrix is low rank if the sample size is less than the dimension though
This thing has a hyperparameter in it which is the matrix you shrink toward
Like when you have near zero samples what do you assume the covariance matrix is
The “uninformative” thing to shrink toward, which is what we do, is the identity matrix times the trace of the sample covariance matrix
But you could shrink toward any psd matrix under weak conditions
Including a sample covariance matrix estimated on the Pile
Or whatever
How would you go about accurately labelling activations when run on the Pile though, if you are doing an intervention at e.g. layer 12/24?
The nice thing about this is that you simply don’t need labels
This is totally unsupervised
It’s just for the covariance matrix of X
And then you estimate the cross covariance of X and Z on the labeled data
Rn the library doesn’t let you do this but it wouldn’t be hard to add
Would require modifying shrinkage.py slightly, I can do that today
Could also add a routine for LEACing an arbitrary subspace
then I guess the other issue is working out how to resolve this
I want to help you guys show that end to end sparse coding is useful, if it is, which I hope it is
@bronze wraith
this one
I created #sparse-coding if y'all want to use it
also if you hate the name lmk
I wanted it to be more general than just sparse coding
@keen pivot said he wanted threads
@pallid current @keen pivot check #behind-the-scenes
Hey all, I’m Rob 👋. I’ve been doing some independent research on sparse coding on and off for a few months. I just stumbled across this channel a few days ago and would be really excited to get involved.
I also wanted to say, congrats! I saw the paper recently put up buy this group on arXiv, hopefully there’s not too much work left. I would be curious to know if this group plans to continue researching in this area after or if there are other groups you would recommend reaching out to?
Hey r0bk! There's definitely plenty of work left to do. What are your specific research interests?
(I have a list of my own future work directions in: https://www.lesswrong.com/posts/CkFBMG6A9ytkiXBDM/sparse-autoencoders-future-work)
@pallid current I keep getting feedback on the paper about “it’s a shame the method doesn’t work out for the last layer”. I’d be up for some human interp benchline. Probably not in time for ICLR, but could at least post it or add it to the paper if reviewers say similar things
Hey Logan! I took a read through your future work post and out of those research areas I’d say I’m most interested in the “Circuits across time”, “ACDC” and “Better Sparse Autoencoders” directions. Most of my focus to date mainly falls into that last bucket, specifically on attempting to identify if dictionary features found are monosemantic through relational metrics and on optimising the sparse auto encoder training for more monosemanticy (both have shown some early potential in toy models). But with that said, the other two areas have been on my mind a lot. Maybe a few quick questions from my side:
- Is there any areas (aforementioned or otherwise) that would be particularly useful for the group to look into at the moment?
- Reading through the chat history in this channel I’ve seen a few codebases posted, what would be the best to play with if I wanted to get aligned with what the group has already done?
(I am a coauthor/contributor to the gh repo you mentioned) awesome! easiest way to run an ensemble of dicts is basic_l1_sweep.py which we recently implemented. you generate activation data & run as follows:
# generate the activation data
> python generate_test_data.py --model="EleutherAI/pythia-70m-deduped" --layers 2 --n_chunks=10
# run a basic sweep (default l1_alpha range is 10^-4 to 10^-2 at 16 log-spaced intervals)
> python basic_l1_sweep.py --dataset_dir="activation_data/layer_2" --output_dir="output_basic_test"
--ratio=4
if you've got a multi-gpu setup you can do training runs on all gpus simultaneously, check out big_sweep_experiments.py for examples of how to configure
I'd like to hear more about the monosemanticity metrics you are looking at (if I'm not misunderstanding, you are optimising for a monosemanticity metric right? if true that would be slightly crazy); I have a couple ideas but I haven't really found anything where the gradient signal isn't fucked
Specifically this repo is what Aidan’s referring to: https://github.com/HoagyC/sparse_coding
Like Aidan, I’m very interested on your current ideas for relational metrics and how you optimize for monosemanticity.
Could you go into more details?
+1
+1