#Sparse Coding
1 messages · Page 2 of 1
Where’d you hear this?
ml engineer said it's a common thing, tryna find a paper or something that shows it but no luck
It's because the model has already updated its weights based on this data. So you should expect it to be better.
i mean sure, but the idea that there's carry over from like the first 1B tokens when the model is still only learning absolute basics, and which persists through the next like TB of training tokens is super surprising to me!
couple things,
is Z(i) supposed to be {x \in domain : ReLU(Wx + b) is 0 in the ith component}?
If you start with two features with Z(i)=Z(j) how do you get that the Z(i) are still disjoint from the domain?
re 1: Z(i) is the hyperplane in which the ith relu switches over from activating to not activating, and its useful to talk about that in general, even if it doesn't intersect the domain of reconstruction
re 2: if Z(i)=Z(j), it may be the case that the Z(i) intersect the domain. however, I conjecture that when this happens, you can "cancel" these features and improve your L1 loss term. As a trivial example: take any reconstruction, then add on two features f_1 and f_2 which are negatives of each other, and which activate equal amounts (ie their rows in the W and b matrices are equal, but their rows in the D matrix are negatives). These have the same Z(i) planes, which can be anywhere, and they completely cancel out in the reconstruction, but they add unnecessarily to the L1 loss on the c term, so they should be trained out
cool that clarifies a bunch thanks
Okay, bad news: I found a simple example where you can get a perfect reconstruction with better L1 loss by using non-canonical features than by using canonical features.
The example:
Sample x and y coordinates iid from the uniform distribution [0,1], so your domain is the unit square in R^2 (and thus your canonical features are (0,1) and (1,0)). Rediscovering those features, you get a perfect reconstruction, and the L1 loss at (x,y) is |x|+|y|, which has an EV of 1 over this distribution of points.
Alternatively, learn these weight and bias matrices:
W = 1/sqrt(2) [1, 1 ]
[1, -1]
[-1, 1]
b=0
D= W^T (the transpose)
Then this also gives a perfect reconstruction, and L1 loss at (x,y) is max(x,y)/sqrt(2), which has an EV of 2sqrt(2)/3≈.942<1 over this distribution of points.
(This is basically Hoagy's example from before. Note that my previous theorem doesn't apply here, because Z(2)=Z(3))
got autointerp working with the nano model on ICA (which works well) and random directions (which doesnt)
More generally, if you have an invertible matrix M, and W consists of M stacked on top of -M, and D consists of M^-1 stacked on -M^-1, that also gives a perfect reconstruction. This works because you're "cheating" the ReLU via x=ReLU(x)+ -1*ReLU(-x).
However, this can be detected by looking at minimum cosine similarities between the features in the dict. If the minimum cosine similarity is -1, that means this sort of cheating is happening.
Okay, I ran this test on the data I had from Hoagy previously, and this MCS=-1 features aren't showing up. That's one bullet dodged!
You got details? & ICA is the baseline to compare the dictionary with?
no details yet, just running on 10 features, and also i totally bodged the random directions calculation the first time (it was just analysing random noise lol) , it's weirdly good now!
i think over the 10 features, random might actually have scored the best.. 🤔 🤔 small sample size etc but still.. odd. i would expect it to still do quite well at finding a pattern in top activations, but would expect very poor scores in the random part of top-and-random scoring. need to make sure that random part is actually happening
if random isn't low then it puts into question the validity of finding additional interpretable features with sparse coding. might need to move straight to pythia
I think we have a few pictures of it happening around here somewhere
W/ pythia, I could plot the top activating examples for a random direction in neurons to check if it looks interpretable.
@bronze wraith I need your math help. If I want to ablate a feature, it makes sense to ablate the direction specified by the decoder; however, for the encoder, there's a negative bias, ReLU & mostly-postively neurons (because of GeLU in MLP).
This means there isn't a "neuron direction" for the encoder. Ex. If I have a bias of -1, 2 positive neurons, then (2,0), (1, 1), (0,2) all activate it, but this isn't a direction (or vector I can ablate)
There is at least a direction specified by the encoder, right? I think in the example you're giving, the neuron activation is 1*x+1*y-1, so in the encoder you'll see a row that is [1,1]. Is that sufficient for your purposes?
There's a direction specified by the encoder, but I'd like to force the neurons to never enter a possible configuration that causes that feature to activate w/o ruining other things.
In the example, I think it'd be sufficient to project along the line: -x + 1.
What exactly is the goal here? What you describe should work for a single feature, but if your features aren't orthogonal, those kinds of projections may not commute.
I'd like to see the causal role of a feature. One way is to ablate it.
My current method is to subtract by (feature vector*feature activativation) w/ vector specified by the decoder)
This may just be the best method, but it does rely on the reconstruction being high-quality, which I haven't checked yet.
and what are you measuring after removing the feature? the reconstruction quality?
Logit difference in output true tokens
Input: vol. 1 at www.boost.com
feature activates on "www"
Effect on output: Lowers the logit assigned to token "boost"
While I'm doing this ablation stuff, I'm going to investigate just training on lots of data over lots of epochs.
anyone know anything about using pandas with big data? main bottleneck atm is that it's taking outrageously long amount of time to either save df to csv, or to convert df columns which are lists into strings which then save fast
A few years ago I recall having issues with this, I think I ended up just converting to numpy arrays. Here are some GPT4 suggestions
got some results comparing neuron basis and random transform on the tiny model and got this 😮
basically i think this means that there's no use doing work on the nano model
unless i've bodged something, which is obv v possible
replacing with graph with means
these are autointerp scores using gpt4
eek
I'll run the same test on Pythia tomorrow and see what we get, I don't expect the same results
I would definitely look at specific examples that activate high MCS features and those that activate random ones.
At least for Pythia, they look quite monosemantic and I could explain them!
Ok so i ran a test on non-repeated data on the residual stream of pythia and hesitantly I don't see the MCS dropoff you see when training on the post-activation-in-mlp dataset
mini-batch increases as you go rightwards. each mini-run is ~2M datapoints = 2GB data (residual stream width = 512)
also running a repeated-data run atm for comparison
I believe I was running at 3e-4
You seem to be getting best results at 1e-3
Is this the default layer 2 for Pythia?
residual stream post-computation at layer 2
so like, data coming out of layer 2/into layer 3
Okay, so this is great. Like if the only reason it didn’t scale is because I picked an awful l1 value, then we’re good!
I'm very uncertain about this, but I don't think we should expect a priori for SAEs to work on MLP-post-activation data, TMOS assumptions might not hold well and SAEs are fairly heavily predicated on those to work
What are the assumptions?
Also, it’s very cheap to just run two experiments on the pre and post
just the whole 'data mostly explainable as a bunch of sparsely-activated features which interfere slightly' thing
It’s in the utils.py file
I can run a few to check if there’s a better l1 than 1e-3 for both pre and post
And which one is better.
Would still be interested in links for SAE assumptions, but again the test is pretty cheap! (a couple hours run)
I meant around MLP layers at all; I don't have a clear, complete model of how MLPs in transformers actually interact with said sparsely-activated features in the residual stream @pallid current wanted to test this
same thing for repeated data; it's pretty much the same afaict
yeah I think pre would work better, but probably still worse than the residual stream by a little bit
in general I think we should be looking at the residual stream
The residual stream would also be cheap, though harder to compare apples to apples with the MLP
but like, I don't see why we should expect SAEs to work at all on the MLP since I don't think the hypothesis 'MLP activations look like a sparsely activated overcomplete feature set' is particularly good or true, nor do I see why we should expect that a priori
like, obviously it has to have some similarities since it is writing to/from a sparse overcomplete feature set (the residual stream) via a linear map but I can't tell how similar it should be
it's also not clear to me why we should expect the MLPs to 'output' significantly more features than the dimensionality of the MLP
Might get to work implementing @pallid current's toy model tomorrow
I'm currently replicating your results & gonna try to find a better l1 value if it exists, & train on that for lots of data.
Additionally, will run one on the residual stream
Also I think this is kind of max-data for pile-10k for residual stream activations (when it's truncated to 256 tokens/line + standard max-senences or whatever, I don't really know how the dataset is configured)
Like I got 8x2GB chunks out of it max
you could mess around with the truncation or do a different dataset
(since the residual stream dataset should be 1/4 the size (in bytes) of the activation dataset @keen pivot)
To make it work w/ residual stream, you have to change both cfg.mlp_width & the tensor_name for transformer lens. (Both in utils.py under make_tensor_name)
Ah, we're one l1-alpha value off.
I did this
moderately confused
I thought you were doing mlp the whole time. I just mis-read your message initially.
Replicating now: Residual stream:
I do think there may be a slighly better l1 value. The sparsity of 1e-03 is ~0 and 3e-3 is ~20 (so 20 features per datapoint on average, which still seems high)
Looks like 3e-3 is great! I can a more fine-grained experiment later if it should be 2 or 4, but mostly looks good. Running a larger run w/ larger dicts on that l1-value.
@bitter turtle There is a degradation in residual stream (top one is later). This is after 8 epochs of 8 chunks (=16GB) each w/ refreshed data every time
Also, I'd expect it to not really plateau here at 30% of 512 (the residual stream size), because the toy model didn't. But this may again be a data diversity thing, or a "we should train it on multiple epochs" thing, or something else.
I think the data is repeated here. Afaict, you can only get 16GB of 512-dim activations from pile10k with our current data processing setup
Yeah, what did the toy model plateau at?
Also, I used batch-size 1024
I'm doing just pile, which should be default. nope. pile-10k is default. Thanks for mentioning this!
Usually ~1, but sometimes lower for larger sizes (I vaguely remember 80%, but never something like 30% which is what we're getting. Additionally, the toy may just be caused by the way we scale feature frequency which would make additional features even less likely)
Yeah pile is fucking big like the first shard is 500GB of text
I may or may not be downloading it on 4 different rented GPU's to learn dicts for all Pythia layers.
I am currently committing to the MLP post activation for Neel stuff, but next week, I'd like to investigate more into residual stream if you haven't solved it all yourself by then!
I'm also not sure how significant the 2nd SF is.
What's 2nd SF? Also this was pile-10k I think, so it was just the same 16GB
What I meant is I'm not sure how significant a change the 2.95 to 2.91 is
Anyway wrt this I'm going to try and get these results with synthetic data and see what happens with that
Notice something interesting. It seems the best l1 value changes. Here the top one is later & 3e-5 appears to do better than 3e-4 (which starts to degrade/stagnate). This is running on 20GB of pile for Pythia-layer-4, repeated for 5 mini-runs. The above is between mini-run3 & 4 (w/ 0 indexing).
This is important because one way to pick an l1 value is to just run it & see which one does better, but here, you don't see it until 20GB*5 repeats in! This replicated across layers 1 & 5 (I'm still waiting on layer 3, which is just taking 2x as long as the others 🤷) Edit: Layer3 also has the same behavior.
In response, I'm running them all again for 60GB repeated 5 times (3x the GB) to see if optimal L1 shifts any more left. The idea here is to just spend a lot of compute & maybe backtrack how we could determine optimal L1 from other corollaries.
I'm also intending on saving to aws every mini-run, so any of y'all can see the dict. Additionally, we can check MCS between dicts at different L1's to see if there is a better l1 value for every size dict.
Thanks @pallid current for coding up the mini-run code, wandb, & saving to aws. You're amazing!:)
Super weird jump to 70% of 2k features in layer 1. That's the most features I've seen.
Just wait until I wake up & these runs are done. This isn't even my final form.
those look really good! looks worth really cranking up the n_mini_runs, and also the dict size and possibly the l1 if it's not a problem at the highest val you tested
i can try testing those runs on our new stronger cluster soon, though need to write some parallelism code.
@bitter turtle could you add a PR to add an option to use residual stream please?
here's a recent paper which uses sparse probing on the MLP output to find basic features, i think this is the closest thing to motivation for the MLP work that i know of https://arxiv.org/abs/2305.01610
Despite rapid adoption and deployment of large language models (LLMs), the
internal computations of these models remain opaque and poorly understood. In
this work, we seek to understand how high-level human-interpretable features
are represented within the internal neuron activations of LLMs. We train
$k$-sparse linear classifiers (probes) on th...
pre is just a rotation of a subspace of the residual stream so i agree residual stream > pre_non_lin but i still think there's good reason this might work on postnonlin. those recent results make it look like we should focus on residual stream
if we're doing that we should also really think about how what we're doing builds on, rather than just replicates, the transformer factors paper that i'm always posting https://arxiv.org/abs/2103.15949
Transformer networks have revolutionized NLP representation learning since
they were introduced. Though a great effort has been made to explain the
representation in transformers, it is widely recognized that our understanding
is not sufficient. One important reason is that there lack enough visualization
tools for detailed analysis. In this pap...
it's big but not that big, the first shard is more like 30GB
wtf are these results?? 70% in 2k but 0 in 4k?
RIGHT!???
Lololololol
This is layer 1 though. I've normally been doing layer 2, so I don't know the effect there.
Layer 4
Layer 5
waaaaait, is the MMCS where you remove the high MCS feats screwing something
like you've removed all the feats in the 4K dict?
This is just our normal code
ok i see you're not like removing the feats from the dict in the MMCS code (i dont understand the hungarian thing)
but like......... wtf, can we see the histograms?
hungarian thing just does the best matching thing given all vectors Cosine_sim
Layer 4:
Layer 1:
Like, I think this works then. Of course I need to do more checks on the actual features, and we'd need to figure out how to get consistant, but like: just use a lot of compute to overcome our ignorance for proof-of-concept
Layer 3 is still like veeeery slow. I'll let it finish out, but man, it's only done 2/5 mini-runs & everything else is done:
lol i dont know how you can look at that and think it works, to me it screams that something is broken
what do the other metrics look like for the 8k and 16k dicts?
Just unstable
so that's 2000 high MCS feats for a 2k dimensional MLP?
which ones? I also think I overwrite our normal variables using wandbd 😢
Like every mini-run
Yep! & The 4k might have more! I'm saying it might be unstable because the 4k relies on the 8k to be good, and the larger dict just dies? Ya still so weird
yeah agreed
Given how noisy this is, I don't want to update too heavily on this. We should probably do a bunch of runs and average them. Also, yeah, didn't Dan say something about annealing L1 over time?
Which one is which?
Real
This one is the first two mini_runs for layer 1
Left is run 1, right is run 2?
Sorry. Left is later (run 2, though technically 1 w/ 0-indexing)
Oh ya, it'd unclear which l1 value is ultimately best & how to determine that w/o running all the experiments.
One thing to note w/ residual & MLP is the sparsity of the MLP w/ tiny l1s (where layer 1 did best) is ~800 or ~400 for 2k & 4k respectively. Sparsity is calculated as average features per token. That's just crazy. Not real.
Residual stream has like 2-3, lol
high features per token is (unsurprisngly) connected to the low l1 val, it goes way lower for the higher l1 vals
Ya, but like, high MCS? So there's at least a converged way to learn to "sparsely" reconstruct between the 2k & 4k dictionaries.
Edit: They could both learn the identity. Expecially w/ such a low l1 value. I can check this!
Also, hoagy, you know how to download aws bucket stuff from command line? I tried
aws s3api get-object --bucket DOC-EXAMPLE-BUCKET1 --key dir/my_images.tar.bz2 my_downloaded_image.tar.bz2, equivalent for mine, but it gave me a Syntax error in their code 😢 (their's does python2).
I can look it up more later (atm just manually downloading locally, then uploading); not a bottleneck!
no, whenever i've done it i do it via my python utils (can just spin up a python instance)
i also ran 50 auto interps random vs neuron_basis on pythia70M, still no distinction 🤔
now very confused, might be a layer thing, might be a bug i dunno
anyone got a good recent pythia sparse coding run to do a comparison to?
honestly im so confused by the recent set of results, i'd really like to talk it through with someone soon
i've a couple for the residual stream on aws
@pallid current I've submitted a PR for vectorised ensemble training for SAEs, should be more performant and gpu-utilising, the merge looks a bit horrific, I'll let you deal with that 😬
I'm also going to do the transformers toy model in a different repo, i cba to deal with the merge conflicts atm
@keen pivot if you have the energy to change the training loop code (I don't unfortunately), the vectorised code should be a lot better for hyperparameter tuning
I definitely can't this week, but it looks good!
Hi @keen pivot @bronze wraith @pallid current, I think we should have a chat about standardising some metrics at some point in future (I want to establish some sort of principled way to reason about goodness-of-extraction-procedure, and maybe implement a testbed for comparing extraction procedures robustly); also @pallid current said something about standardising PRs, tests, etc
Found a good example to show the "features get lost over time" thing
What's this on?
Layer 1 Pythia. 20GB*20 times (fresh data though)
MLP?
@keen pivot what's your situation with neel and final projects?
i think we should do a meeting on monday to plan what we need to do next because it's starting to feel scattered again
my tentative plan is that we should identify all of the variables of interest, do a bit of work to make the code more parallelizable, and then use the eleutherAI cluster to do a much broader sweep than we've done so far, so that we can have a single pile of data to look at and draw conclusions from (and then presumably branch off a bit)
I’ve got a few awesome features to show and a few failures (like logit lens on feature direction)
Will be done today (interview is tomorrow, hear back on acceptance Monday)
good luck!
I might be a bit pressed for time on Monday (moving house) might be available in the evening tho
Oh, what happened with logit lens on feature direction?
It usually just shows nonsense tokens. Tuned lens might help?
Ablating that feature direction does tend to have a meaningful affect (e.g. ablating the "www." feature affects urls)
On gpt-2 or pythia?
pythia
I think I'm done. I can't make changes past 9pm my time, but feel free to look at it!
https://docs.google.com/document/d/1KqHe9NL9NuJ_yaKJc__eX6kjBtRcROePoLApjOhGBUU/edit?usp=sharing
Interpreting Sparse Dictionaries Sparse dictionaries can learn meaningful features, w/ max cosine similarity (MCS) correlating w/ meaning & monosemanticity. Note: MCS is between two different dictionaries (ie rows in the decoder’s linear layer) w/ the intuition being if two dicts learn the same ...
Okay, I've got a few sources of evidence that the Pythia dictionary that learned ~2k features actually did converge on identity:
- Several features do look polysemantic like normal neuron interp
- Looking at one feature, it only has 1 neuron that activates >0.3 for the same max-activating datapoints & 1 outlier positive weight in the decoder (ie activating it only reconstructs one neuron)
- The L1 was way lower than expected
- Sparsity was crazy high (400 or 800 features/datapoint)
@pallid current @bronze wraith @bitter turtle I'm getting meaningful, monosemantic features for even low-MCS features in Pythia-mlp-layer-1. This was trained on 60GB*5 (repeating data). For MCS-above 0.9, it went from 55% to 45% during training, but looking at several features, they all seem meaningful.
Oh sweet, if you send me the dict I'll run a comparison on autointerp with random and neuron badis
How many GB is pile-10k?
In activations?
I think. I want to compare when I do n-chunks=30 for Pile, I think that's 60GB.
@bitter turtle @bronze wraith , We could have a voice call again tomorrow, same time as last (GMT-16:00, 12pm eastern, 5pm UK)
I can write a few topics-to-discuss beforehand, so the call is short & useful. Please include what you'd like to talk about!
From Hoagy earlier:
my tentative plan is that we should identify all of the variables of interest, do a bit of work to make the code more parallelizable, and then use the eleutherAI cluster to do a much broader sweep than we've done so far, so that we can have a single pile of data to look at and draw conclusions from (and then presumably branch off a bit)
Notably, we do already have a few dictionaries on the bucket to look at now, which may inform what type of information we're missing (to code and get for the next iteration of dicts)
5pm is maybe a little early for me unfortunately, how is GMT-17:00?
For me, top of my head:
- Have evidence that the features we learned are indeed meaningful (intend to write an LW post)
- Evidence that dicts on pythia-layer-1 learns meaningful features for even low-MCS
Proposal: Can look at dicts that go low-to-high-to-low MCS, see if it ends w/ meaningful features & what happens
General Proposal: Look at "features across checkpoints" pythia style, w/ more frequent checkpoints earlier. I'm suggesting spending ~20 minutes/feature (after time spent writing functions) .
How are you currently evaluating the meaningfulness of features (in the case without autointerp)?
Like that doc I sent earlier, let me re-link: https://docs.google.com/document/d/1KqHe9NL9NuJ_yaKJc__eX6kjBtRcROePoLApjOhGBUU/edit?usp=sharing
Interpreting Sparse Dictionaries Sparse dictionaries can learn meaningful features, w/ max cosine similarity (MCS) correlating w/ meaning & monosemanticity. Note: MCS is between two different dictionaries (ie rows in the decoder’s linear layer) w/ the intuition being if two dicts learn the same ...
I mainly focused on gathering a lot of information that helps narrow down the hypothesis of what property of inputs that feature is activating on.
This can be repeated for the output (as in, we think the model has developed a discriminator for property X, which was useful for predicting Y, so gain info & test for that), as well as intermediate layer's features.
Btw. the doc is like 15 pages, mostly images, only 1k words. I can answer any questions about it!
Hi @keen pivot , read the doc, very cool! Some thoughts:
- I had a bit of trouble following what was being shown in each image. Captions might help?
- One ways we could test robustness of our understandings of the features: after getting the text description (e.g. "period after www"), we/GPT write new text we think will activate the neuron, then run it through the LM again and test if we can activate the neuron on demand.
- I'm worried by the fact that our features are ~half punctuation/math/urls, it makes me think we're picking up just a certain kind of neuron, but "useful" semantic meaning is either not in this part of the LM or is less able to be found by our method.
Thanks Robert!
- Agreed. Writing a post and clarifying pics better currently.
- This is covered in the “created examples” and the token search section
- This is Pythia-70M in an earlier layer, so maybe it doesn’t have such high level features. I could train one on mid layer 13 Billion though
from looking at the neuron-in-haystack paper (https://arxiv.org/pdf/2305.01610.pdf) they also used pythia70M and found a lot of sparse, meaning-related action in the early layers, exactly the kind of thing that we'd want to be able to find
so i dont think it's just a scale issue
maybe a skill issue instead?
Can you give an example?
If it's:
but neuron 1B.L6.N3108
activates on return if and only if it is in the context of Go code
I found two like this, w/ one being the opening $ in latex
looking at section 5.1
Like general french neuron or Go (programming language) neuron?
5.1 is compound words, which were also found (but not necessarily the same ones)
which ones are you thinking of? my general feeling looking at that is that they're showing superposition at a much much more granular level, whereas the things we're finding, or at least managing to interpret are much broader
i'm wondering whether we should use the neuron in a haystack approach to find some directions which we reckon are exactly the kind of thing we would hope to find, and try to check that sparse coding is actually something that's capable of finding it
https://docs.google.com/spreadsheets/d/1p1Wu4vJ1fKYsMtjrXFboQpIOl_sKue-dgabFt2EYw0s/edit?usp=sharing
Top MCS
neuron_id,MCS,Features,Monosematic like:,Repeats
0,0.9983,Newlines, periods, unclear,0
1,0.9895,Unclear,0
2,0.9873,www, decreased by http://,1
3,0.9859,period after www,1
4,0.9852,quotation mark after another quotation mark, but maybe a period before?,1
5,0.9802,the word type after a hyp...
I think some of these are pretty specific? (edit: e.g. just "x", & I know there are other char-level features like "w", & ", and") I also expect other re-tokenization words (like Harvard) to be here in layer 2 & also layer 1.
Though I'm all for this! Wes sent me the link to the datasets used (https://www.dropbox.com/scl/fo/sb5jwfki7t4kvk2rr38t0/h?dl=0&rlkey=gor3lctozovy8417p8k27zbun).
Which should be easy to check once you've trained a k-sparse probe.
Just to confirm, we've moved the meeting to this time slot?
reading this paper now, are we doing anything about this?
This example also underscores the dangers of “interpretability illusions” caused by interpreting neurons using just the maximum activating dataset examples
not really, the autointerp stuff does top-and-random scoring which tests ability to work on random as well as top
but it doesn't do anything to rule out the possiblity of it also responding to other things
if autointerp does both top-and-random then that should give us some insurance against that "interpretability illusion"
I think so yeah. @keen pivot @pallid current ?
This is my primary concern with current things
I'd recommend reading my doc. I know the images aren't good & I'm making a better one not optimized for Neel. I do uniform examples
Haven't really been able to read through your doc unfortunately logan (just got back from a school meetup thing actually)
Hoagy also is going. I chatted w/ him earlier today
Now is GMT-17:00 right?
An approach I saw for dictionary learning: https://en.wikipedia.org/wiki/K-SVD
In applied mathematics, k-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value decomposition approach. k-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse coding the input data based on the current dictionary, and updating ...
reposting the meeting notes doc here https://docs.google.com/document/d/1J7tGoAhlTqrFfqHgAjehe3HANX2wmb7RlxeVwiEHslM/edit?usp=sharing
Results: Concerns: Dictionary has to reconstruct negative activations & noise? Learned features that are low-MCS Experiments: (logan) - Provide rigorous explanation of low-MCS meaningful features Provide random directions to compare (logan ask Robert) Example of a feature that could be spread ...
also i wanted to say before logan left, lee's pushing to get a meeting with olah and anthropic types over the next week or two
i'll try to get you guys included which should be possible, if not then i can grab questions from you here
For targeting specific features we could literally just do feature learning on e.g. truthful QA activations
Nora convinced me that one metric to use is model editing metrics. Specifying which model editing metrics would be clarifying on its own, and also help compare different ways of training.
There are a few different baselines to compare against here which would be nice to contrast!
Though I will say some TruthfulQA answers aren’t the more truthful, but the more weird answer. I’ll need to give more details later.
Wdym
Any sort of model steering (like make the model more honest or only perform circuit X) can be compared with previous work on model editing.
Oh right sure
@keen pivot would definitely like to do that with this
So like train on activations generated on e.g. truthful QA then select+edit features with the most variance between+consistency within true/false classes (basically VINC but manually on directions restricted to those found by sparse dicts)
this seems functionally equivalent to sparse autoencoders
agreed in spirit it's similar, though the approach i think is quite different in the fact that it can learn the dictionary activations however it likes, rather than being Relu(Linear(Y)), and the practice of optimizing dict elements 1 by 1 i imagine would lead to very different results
+1 to hoagy's reply: it has the same inputs and outputs (activations and a dictionary of features, respectively), but the internal process is different. If we're lucky, k-SVD would be some combination of faster/make more accurate reconstructions/find "more meaningful" features compared to the autoencoder.
still not sure why me and logan are getting different interp results but i am finding a different between neuron_basis and sparse coding on pythia layer 1
dotted lines are 2SD from mean
will increase sample size later, its busy trying to perform ICA atm
on the other hand, not much evidence of relationship between MCS and autointerp score
relationship might eventually prove 'significant' but high correlation looks unlikely
For this layer 1, ya I think lower MCS was still meaningful to me.
I can show that hopefully today or tomorrow!
I mean, since it has the same objective, I guess I'd be surprised if it didn't converge on the same dist of dictionaries. The activation thing hoagy pointed out is a very good point tho, I guess maybe we should look at testing more powerful encoders/making the decoder an affine transformation of the sparse dictionary for the big sweep.
making the decoder an affine transformation
what dya mean?
do you still think this is a bug on your end or a more fundimental thing
currently, our setup looks like
dict = ReLU(Affine(x))
output = Linear(dict)
maybe we should make the output = Affine(dict)
my intuition is that that should 'strengthen' the abilities of the encoder slightly
also goo point that it might coverge faster
One possible reason they'd come to different results is the L1 penalty, which is in the SA but not in k-SVD. The SA is willing to accept a poor reconstruction (even sending every vector to 0) if it minimizes activations enough. In contrast k-SVD doesn't have an L1 penalty, so its reconstructions dont skew towards 0.
oh shit yeah my fault i am blind
@bronze wraith where was the implementation you mentioned for K-SVD?
It’s a python library, ksvd! https://github.com/nel215/ksvd
Right now I’m playing with it and I’m finding it a little fiddly (it throws errors if your dimensionality is off)
right, we might need to come up with a batched, streaming implementation of that if we want to scale it @bronze wraith
It's unclear to me if adding a bias to the decoder would be good or bad (I really don't know!)
same! one way it might be better is that it might allow the encoder more flexibility in denoising the input signal, but ofc (for MLP activations at least) we already see large biases, and it might just exacerbate those.
I think adding a decoder bias is approximately equivalent to centering the original dataset at (0,0), so that might help you build intuition.
For auto-interp, atm it's useful for detecting monosemanticity in the input distribution (ie we assume GPT-4 coming up w/ hypotheses & GPT-3.5 creating accurate predictions in held-out text correlates w/ the underlying feature having a simple description across the entire feature activation range). This can be repeated w/ the features effect on the output & other layer's features.
But, this still leaves open the "interestingness" of features. Two desirable properties are:
- The features explain all behavior on the data distribution we care about (ie low reconstruction loss as discussed yesterday)
- We can simply express any feature we actually care about (e.g. deception, honesty, australian accent, etc) using these features (w/ "simply" maybe just meaning sparse)
When I look at decoder weights, there's a lot used for reconstructing the negative neuron activations from the GeLU. This seems like it'd be generally true for all features learned, but would cause a problem if multiple features activate at the same time because they'd try to reconstruct the original distribution, but would overlap w/ each other.
The learning process would probably learn correlations for features so each feature only handles 1/N% of the job for reconstructing normal neurons, but there will be noise. I was thinking the bias might help here, but I don't think so.
I've found a paper discussing a GPU implementation of matching persuit (~the sparse approximation algo that the original K-SVD algo used (I think it actually used OMP cant remember)): https://arxiv.org/pdf/0809.1833.pdf, and the other part of K-SVD seems to be just SVD so I think we're good on the parallisation front
for residual stream data yeah, different for MLP thingies maybe?
what do you mean 'negative neuron activations'?
oh, you mean the model uses GELU instead of ReLU and the decoder devotes features to those small negative parts?
Let me clarify why i said that: fix some bias v in the decoder. Let's say we: 1. shift all inputs back by v (x -> x-v), 2. update the encoder biases the undo this, and 3. remove the bias vector from the decoder. These transformations together should not change the l2 loss term (both your inputs and outputs are shifted back by v, which cancels out), nor the l1 loss term (the encoder activations are exactly the same). So for any decoder-with-a-bias, you can make an exactly equivalent sparse autoencoder without a bias, on a shifted input set, which has the exact same loss. (Epistemic confidence: high)
(Epistemic confidence: medium-low): I assumed that centering the dataset at (0,0) minimizes the overall amount of l1 activations needed to make a reconstruction, and therefore this centering would be optimal. But I might be wrong about this
I'm not sure this is true for the MLP activation case, where the activation mean is far from the origin. If the toy model asumptions are true, you have a solution with ~zero (ignoring noise) l2 reconstruction loss and 1/k l1 loss for k-sparse activations when the data is not centered.
agree with this tho
Almost! The decoder doesn't devote features, but for a given feature, some of the weights are negative to account for this. Maybe that's ideal. idk
Here's a hist of weight*max-activation for that feature. The minimal negative value in-distribution is -0.17 for the neuron basis because of the GeLU
Oh, sorry one more thing about having a decoder with a bias: you can implement them in autoencoders w/o bias, though there is an L1 penalty. In particular, since the encoder has a bias, you have your encoder learn a feature that always activates exactly 1, and the dictionary element corresponding to that will be your bias. There is an l1 cost since youre always activating that internal feature, but if we specified some features as not having an L1 penalty, this would bypass that (i.e. instead of our L1 penalty being ||y||_1, we make it ||Py||_1, were P is a diagonal matrix of 0s and 1s).
it does have a minimum required sparsity level though afaict, which might turn out to be equivalent? who knows. will be fun to implement.
Yeah, right now I'm running tests to compare speed/reconstruction accuracy/mmcs between SA and kSVD on the toy data. I think I'll have results today!
sick
@bronze wraith , could you give a concrete example of a feature that may be spread across two MLP layers in a Transformer? This is based off the "concatenate multiple layer's activations as input" which both you (& Neel) brought up.
How about "BERT Rediscovers the Classical NLP Pipeline" (https://arxiv.org/pdf/1905.05950.pdf )? They used linear probes on the internal activations of BERT to try to extract stuff like "part of speech", and find that this info is spread across multiple layers. E.g. this part of section 3.2:
We would like to estimate at which layer in the
encoder a target (s1, s2, label) can be correctly
predicted... A
naive classifier at a single layer cannot either, because information about a particular span may be
spread out across several layers, and as observed
in Peters et al. (2018b) the encoder may choose to
discard information at higher layers.
(emphasis added; link to Peters et al, which I haven't read: https://aclanthology.org/N18-1202.pdf). In the attached image (part of Figure 2), the blue bars are showing which layers are important for correctly probing part of speech, and you can see that info is spread across several layers
Made my much better post! https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tentatively-found-600-monosemantic-features-in-a-small-lm
Update from last: I coded the quantile thing wrong, so now it does look like ~ 60 neurons (still maybe)
Also: Compared w/ the identity dictionary with specific feature seeming to represent only 1 neuron
Looks like pretty strong evidence there, especially the K_delta, thanks!
I do think then that there will be features more simply represented across multiple layers & doing it only by each layer will have the same representational capacity, but require multiple individual features to do so.
One problem noticed when training the dictionaries at different layers is that later layers learned way less features & had more dead features (e.g. even 50% of 2k). Data:
Dead neurons (layer 1,3,4,5) (skipped 2 because I already had a dict for it when I trained, but trained on different data, so not including it):
0, 1.7k, 1.3k 1.2k (out of 2k)
High-MCS features:
46%, 4%, 14%, 13% (out of 2k)
Possible explanations:
- Other dictionaries need a different l1 value (not likely imo. I sweeped l1 values from 1e-5 & 1e-3 & MCS > 0.9 features plummeted for 1e-3)
Note: it is max for 3e-05, but I expect that's because of learning identity because features/token = 700.
- There are many more features in early layers (focused on grouping tokens & re-tokenizing certain pairs) than middle layers (higher level features?) and later layers (re-tokenizing features). The dead neurons are caused by iterating over the same dataset. Evidence for this here: https://wandb.ai//sparse_coding/sparse coding/reports/Layer-3--Vmlldzo0ODA5ODA1?accessToken=w6c775vurw0wtu80n3617dl41ts9wvqjfnm41uc4xj1pi0yc7vd009lu4ntu8zvr
If (2), then just running mini-runs w/ fresh data should show increasing number of features in layer 3 at that l1 value. I think I do a fresh-data comparison for layer-1, so I can check that now at least.
Update: can't tell. layer-1 has 0 dead neurons even when you repeat data, so can't tell generalization behavior to layer-3, which is the one I linked. Additionally, this is the layer that when trained on new data goes up to 40% (MCS above 0.9), then down to 3%. This also needs to be explained.
Possible explanation: The smaller & larger model are simply learning different features because there are so many, especially at such an early layer & w/ so much data.
Additionally may be a thing where dicts of different sizes are biased to learn different features (which was brought up before here), so a previously proposed test was to learn two dicts of same size w/ different initializations.
Additionally, I could look at the features learned by the dictionary. If low-MCS features are meaningful, then seems true.
Update on ksvd: when I compared it head-to-head with sparse autoencoders on the toy data, it failed to reconstruct the original features. The MMCS was something like .266 instead of .999 for the sparse autoencoders. Tomorrow I'll try tinkering with it to see if there's a fix!
@bitter turtle , I'm going to get to it soon (like tomorrow or the next?), but if you'd like to see if the currently learned features across layers have meaningful connections/circuits, it'd be great to have another set of eyes here.
Glad it's working!:)
I'm curious how well you or I'd do on the task we're giving GPT-4/3.5. Like number normalized activations given a hypothesis (or form the hypothesis). How hard would it be to set that up?
Sure can do
Just send over the dicts you're looking at ig
Have you checked the dictionaries K-SVD is learning? OMP learns positive and negative entries for the sparse dictionaries; it might be something to do with this maybe.
I think you're right its using sparse (positive and negative) combinations, instead of sparse positive combinations, and the issue is probably there. I'll try looking into positive versions of ksvd or see if there's a workaround. Oh, and one thing I should do is look at absolute value of cosine similarity, because it might be learning -1*feature (which would be great but would be ignored by mmcs)
yeah there's a thing called 'nonnegative OMP', i think scikit-learn probbably has an implementation
ok @bronze wraith cancel that it doesn't, here's an implementation: https://github.com/davebiagioni/pyomp
just replace the calls to orthogonal_mp_gram with calls to this in the ksvd thingy
more results from autointerp, this time on gpt2 small, still the neuron basis totally failing to beat the random baseline but much higher scores overall. i only recently noticed they use layer 10 of gpt-2 small for their autointerp comparison so i'll run that next
it's not clear why they used layer 10 of small when the rest uses XL, might well be cherry picked
Gah, I've spent the entire day trying to implement a nonnegative version of K-SVD for the GPU/PyTorch, only to learn that most non-negative least squares algorithms are like deeply not designed for GPUs. Some people have written CUDA kernels for them, and I could do it with GD, but I'm probably going to throw in the towel for the moment and try and do more useful things tomorrow.
If @pallid current could set me up with the 8xA40 rig maybe I could get to setting up some generic parallelised code for the big sweep?
Checking the claim "Low-MCS features are meaningful":
Layer 1: maybe slight correlation w/ MCS & meaningfulness by Logan's standard. Important difference here is layer 1 has ~50% features learned. Also, the low-MCS features felt lamer. https://docs.google.com/spreadsheets/d/1BnFaqn8W9aM1rlosiYFG64VVeeYQNJWFYM-Qt4qj-0o/edit?usp=sharing
Layer 2: (in the LW post, clear correlation, but dominated by dead features)
Layer 4: Also dominated by dead features, but clearer correlation w/ MCS. For the top-MCS features, 75% monosemanticity, whereas low-MCS features are 25%.
https://docs.google.com/spreadsheets/d/1DaPl4sm7KvKr2eVtf2DGSFXLWQyxZaommXm6F1uSnRw/edit?usp=sharing
Sheet1
Link to autoencoder: ? (need to double-check w/ wandb)
dict id,Id,MCS,Feature,Monosemantic? ,autointerp expl,auto interp score,autointerp match,Note
1910,0,1,=" especially after ref-type,1,elements of physical addresses and names.,-0.03,0
1597,1,0.99,big spaces then usepackage,0,the word...
Sheet1
Id,MCS,Feature,Monosemantic? ,Note
0,0.99,Begining & end of first sentence (like a header?),1,0.7619047619
1,0.99,grammar? Uhhh???,0,1/5 tokens activate it, predominately 3 neurons
2,0.99,grammar? Uhhh???,0,1/2 tokens activate it, predominately 2 neurons
3,0.99,Opening ( for english, but ...
I also may have noticed a pattern in feature vectors that make sense vs those that don't. Ones that don't have a fairly symmetric weight distribution, whereas "real" features have longer tails.
I could show this by plotting MCS by mean & std of the weights., lol nope. No correlation at all.
I believe this will be mostly correct, but misleading because different neurons have different activation distributions, so plotting weight histograms by quantile. I'll also need to multiply the weight by the max-activation of that value I think.
Actually cancel this I can probably just use LASSO and hope for the best. Same number of hyperparams to tune so it's probably* fine*.
God I am falling down so many rabbit holes for different sparse factorization mechanisms.
more graph posting, here are the runs from layer 10 gptsmall, this one really should match the results in the direction-finding part of the openai autointerp paper. instead the results are much better, but also dont show the same size of different between neurons and random directions
@bitter turtle let me know when you're free today and i'll get you set up on the eleuther compute, i'll be in the office in about an hour
@pallid current are you available to do this
Hey Aidan, sorry had a meeting earlier so stayed home for a bit, on the train in now, will be there around 12 o'clock
Here, the neuron-basis has a median(?) score of like .29 & random is .23, and in the paper it's 0.15 & 0.037?
mean not median, and 0.06 for random not 0.03 but essentially yes
just before the graph in https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html#sec-direction-finding
'We find that the average top-and-random score after 10 iterations is 0.718, substantially higher than the average score for random neurons in this layer (0.147), and higher than the average score for random directions before any optimization (0.061).'
Isn't that optimizing a direction that's explainable?
0.7 is post optimizing, 0.147 is neuron basis, 0.061 is random direction
Thanks for explaining:)
So it looks like you're finding really awesome random directions then, lol
Like this is their random-only scoring.
Is it legal for this to go below 0?
careful, it's confusing, that's random-only scoring which means that they evaluate their ability to predict the activations on random samples
not that it's a random direction
and also that's layer 10 for gpt2 XL, not small
but yeah it is possible to go <0, im not sure why they cut it off
Lol. Okay, read the paper some more and got it!
Have you tried looking at the random directions that scored highly yourself?
well i've looked at the activations that i feed into gpt-4 and those make sense, and also match up to me to the feature activations i got out when got the activations on their own
my guess for why i'm seeing surprisingly high scores is that i'm working with 50k fragments from openwebtext, but those fragments are from the very beginning of the corpus, which means there are multiple fragments from the same bit of text
and so the top-scoring fragments in the validation set are quite likely to come from the same paragraph as the top-scoring fragments in the train set, which makes the task quite a lot easier
changing it now to only take max 1 fragment from each sentence. i think sentences are uncorrelated but will check
But would that help for random directions? I would expect a random direction to represent many features as you move along it, so would only do good when predicting other high-activating examples, and fail terribly for random-activating ones.
yeah i think something like this is going on, but we're doing random-and-top scoring and i guess most of the score is coming from top
(i think neuron basis is also pretty bad for random samples)
Just a heads-up: pbbly gonna switch to safetensors for the big sweep, it allows memory-mapped/'lazy' tensors which is pretty important for not eating memory when we're doing parallel runs
ok sounds good
hmm i guess you might not need to idk yet
i dont know anything about safetensors but seems like it's on the up so interested to play around with it
this solves the problem of surprisingly high scores. unfortunately, it also makes the neuron basis completely useless, so i guess all previous results are spurious and still work to do. ah well.
https://docs.google.com/spreadsheets/d/1BnFaqn8W9aM1rlosiYFG64VVeeYQNJWFYM-Qt4qj-0o/edit?usp=sharing
Looking at the features of layer 1 across training (for 2k dictionary). My understanding of feature tracks w/ self-similarity.
Layer1 5 repeats
Link to autoencoder: ? (need to double-check w/ wandb)
dict id,Id,MCS,Feature,Monosemantic? ,autointerp expl,auto interp score,autointerp match,Note
1910,0,1,=" especially after ref-type,1,elements of physical addresses and names.,-0.03,0
1597,1,0.99,big spaces then usepackage,...
Green means monosemantic, yellow means idk, red means polysemantic
There are 3/6 examples that mostly stay the same feature.
2/6 appear to represent meaningful, monosemantic features, but change over time.
1/6 might represent some meaningful features, but it's unclear but does clearly change meaning across time
For the 4k dict, I checked how many features have a self-similarity of 0.9 (after 60GB of data), which then dropped below 0.6:
"The number of vectors that reach at least 0.9 and then drop below 0.6 is 1707."
So plotting all of them, there are 20-40% of features that "changed" (though I haven't proved that high self-similarity implies a meaningful, monosemantic feature)
For the 2k dict, it's like 600/2k, so 25% of features that change drastically.
One solution may be to train the dict again, but if a feature's self-sim> 0.9 after each mini-run, we learn a gradient mask (Hoagy's suggestion) to keep those features.
This also lends itself to a stopping point: if all features are frozen (because they were >-0.9 self sim after a minirun), then we're done.
On the topic of "Other ways to do dictionary learning", this suggests an adaptive l1 parameter. Additionally, the previous section discusses using fast ISTA (FISTA) for the alternating updates (like K-SVD does), which lecun papers also use.
paper link: https://arxiv.org/pdf/2108.11730.pdf
I was trying to figure out if this sentence related to the "an" feature, and the translation was just a bit surprising.
Turns out though that "ad" is italian for "to", but only when the following word starts w/ a vowel (else it's "a"). So very much like "an"!
One big confusion I have is that the max-activating logit diff doesn't every make sense, even in the last layer, where I'm just directly unembedding.
For example, in distribution, the "an" feature affects the log-prob prediction of vowel-starting-words, but only 1/30 max-diff tokens start w/ vowels, and the one that is is "ural", with no beginning space, so doesn't really count (I think, though I'd count "est" in "c'est")
Note: Transformer Lens uses twice the GPU memory as just loading in the model normally. I think baukit is the way to go for larger models & we'll just need to multiply by GeLU for the activations.
Oh yes! Oh oh! ahaha Oh...yes
right got some good parallelized training code setup, it's a bit weird and finiky to use since I wanted to do vectorized model ensembling and that doesn't vibe well with multiprocessing/shared memory sometimes so I came up with a bunch of hacks to get around it
brief walkthrough of current system if you want to use it (code on my gh)
current setup is you write some functions like this
class FunctionalSAE:
@staticmethod
def init(activation_size, n_dict_components, l1_alpha, bias_decay=0.0, device=None):
params = {}
buffers = {}
params["encoder"] = torch.empty((n_dict_components, activation_size), device=device)
nn.init.orthogonal_(params["encoder"])
params["encoder_bias"] = torch.empty((n_dict_components,), device=device)
params["decoder"] = torch.empty((n_dict_components, activation_size), device=device)
nn.init.orthogonal_(params["decoder"])
buffers["l1_alpha"] = torch.tensor(l1_alpha, device=device)
buffers["bias_decay"] = torch.tensor(bias_decay, device=device)
return params, buffers
@staticmethod
def loss(params, buffers, batch):
c = torch.einsum("nd,bd->bn", params["encoder"], batch)
c = c + params["encoder_bias"]
c = F.relu(c)
normed_weights = nn.functional.normalize(params["decoder"], dim=0)
x_hat = torch.einsum("nd,bn->bd", normed_weights, c)
l_reconstruction = F.mse_loss(x_hat, batch)
l_l1 = buffers["l1_alpha"] * torch.norm(c, 1, dim=1).mean()
l_bias_decay = buffers["bias_decay"] * torch.norm(params["decoder"], 2)
return l_reconstruction + l_l1 + l_bias_decay, (c, l_reconstruction, l_l1)
and make a bunch of instances of it like:
all_models = []
for i, dict_size in enumerate(dict_sizes):
models = [FunctionalSAE.init(activation_size, dict_size, l1_alpha) for l1_alpha in l1_alphas]
all_models.append(models)
then, vectorize all the models with the same internal dimensions with FunctionalEnsemble (each ensemble on a different GPU) and send it to dispatch_on_chunk to do all the multiprocessing
it's weird but it works and ensembling is good
especially for our small dict sizes
@manic wind Did you ever work on your idea of contrastive learning of features and their effect on the output?
Ooh, what was that
If you really want to do good model editing based off internals, it makes sense to just directly learn which neurons have which effect on the output.
So in this case, optimize for encodings of neuron activations and their respective logits to be similar.
Ah no, ended up working on casual models and relating them to transformers, but would be interested if there's any ideas with the contrastive learning. It was just an idea, didn't do anything towards implementing
I think this’d be a great project to at least try.
I could get some initial results and then try to find contrastive learning people to work on it.
what, so, find directions that correspond to max logit change or something?
Go for it.i unfortunately don't have time to help out. Most likely
But would be excited to know if it has any hope
isn't that like, one cross-covariance matrix calculation (assuming ~linearity)
I don’t think so. For images and captions, there’s structure to what’s an image of a cat that will link all cat things in the latent direction.
Here, we’d be learning which neuron directions correspond to which changes in logits.
I guess that's kind of what I meant maybe slightly
Ya, I’m unsure on the current literature what people tend to do. Linearity of input might not work here.
anyway on a different tack what hyperparameter ranges should we look at; I am pretty close to being able to start off a run on the pod @pallid current @bronze wraith
For which setting?
what literature?
Contrastive learning
all of them, predominantly dict ratio, l1 coef, l2 bias weight decay coef
that seems unspecific
like sparse feature extraction is a type of contrastive learning
Typically 3e-4 works for the Pythia models so around there? (Though I think a little higher may be better)
Not more than 0.1, not less than 1e-6
cool
The thing to look at here is the features/datapoint. Shouldn’t be <1 or >100
I was considering using a package for a particular contrastive learning method called CEBRA that was actually designed for neuroscience experiments. That may or may not be the right move though
Totally don’t know about weight decay
what is the time-series you're considering here?
yeah im probably only going to look at 2-3 decay settings (including 0). If it is good, we can look more into it later
Weight decay is kind of weird because of the distribution of neuron activations, but I agree that something like this should work
Oh, also, does anyone know if the encoder turns out to be some scaling of the transpose of the decoder? it seems reasonable that it should, and i'm wondering if we can get away with doing (x * D.T) @ batch instead of E @ batch
The thought was to replace time with logit values or something like that. It really depends on the details of what you want
Sure! I actually don't have time now but can explain later if you start a channel (or DM)
The median activation of neurons in layer 2 (pythia post GeLU) has a few outliers, w/ the majority being negative because of the GeLU.
Weight decay would interfere w/ this reconstruction, but not if we had the data normalized (I think)
Oh ya, a tied encoder/decoder for the linear layer would be interesting. This would get around the problem I have of there not being a specific direction to ablate in the encoder.
I don't think so. Just divided one by the other, so the resulting ratio should be a near constant if so. Could've coded it wrong, but both are size 4k,2k & div was same shape
Normalise how here?
Oh, I meant to include scaling by a vector
Like if you normalise the columns/rows or whatever and then compare are they similar
probably pre-GeLU
So like take pre-GELU data? Hmm idk might as well residual stream it
Maybe we should just take ReLU of the activations and blindly guess that no useful information is stored in the negatives
Oh I meant normalize pre-GeLU, then apply it. Though maybe normalizing post-GeLU is equivalent
Could be tested by doing that & checking the CE loss. I believe Noa Nabeshima did that & did see an effect on like early layers or something
centered mean & std of 1
Though maybe a weight decay & bias in decoder as you mentioned earlier?
that seems like not a very good idea for MLP activations, especially since we don't have a bias on the decoder, and the directions probably don't start at the mean of the data. Also not sure what you gain from whitening the data here. Centering makes sense for the residual stream for sure. Also, not sure it makes much sense for pre-GeLU, since it probably severely impacts model performance and we wouldn't be looking at model activations anymore
Do you have any results on this front
Ah, thanks for reminding me!
And in case I swapped the row & columns
If the concern is model performance, then you can save the mean/std you use to normalize & then undo it when you do model editing
But if the bias on decoder helps, then that may just be easier to do.
I mean like I highly doubt we'll learn useful sparse dictionaries if we center the MLP activations
and train on that
since we won't be searching for a dictionary the model 'uses'
like, the data is intrinsically not centered
hmm
maybe look at mean cosine similarity? Like, check if E @ D where they're both normalised has ~ones on the diagonal?
I just don't have any dict/encoder pairs to hand unfortunately:(
I'm throwing in the towel with KSVD. I found it worked very well if you know ahead of time how many features there are, but it struggled if you increase that number by just a bit. In the attached diagram, the 10 standard basis vectors were taken as features in 10D space, and we randomly chose 3 to be active at the same time. The horizontal axis is the number of features you told it to find in each OMP search, ranging from 3 (the correct value) to 6. At x=3, it converges in 23 epochs to the correct features, but for x>3 it takes >100 epochs to converge and at 100 epochs the MMACS (mean max absolute cosine sim) is bad. (I ran these til convergence and the MMACS never became great.)
Thanks for trying! Do you think this is irrecoverable? Generally methods like this would have the same problem?
One reason we may want a changing l1 value is because larger dictionaries have high feature-activations/datapoint.
We could instead vary the l1-set point: start it low & increase it until the feature-activations/datapoint are at least X (and maybe increase it if goes below but I don't expect that). Then we can just vary this set point when doing parameter sweeping
@pallid current @bitter turtle @bronze wraith Getting really good results looking at residual stream layer 2 (thanks for pushing residual stream Aiden!)
Like types of features (and ablation effect):
- Single token detectors (strong effect on bigrams)
- German detectors (strong effect on specific German words)
- Words after places (other statistical bi/tri-gram stuff? Unsure)
I'm also getting like 1k 2.5k features.
For the image, it's the ablation of text effect. So ablating the location-token makes the next token's feature activation go to 0.
Importantly, this has a much stronger & intuitive effect when ablating the direction.
Notably I made two changes at once (which I need to untangle)
- switched to residual
- Added a bias to decoder (which may affect the logit-diff ablation effect)
I'm somewhat confused by these results, so because this is a literally textbook algorithm, so its weird to me that it wouldn't perform particularly well. So either I'm doing something wrong (and aidan and the package implementing this also did that stuff wrong), or theres some magic sauce in sparse autoencoders that I don't understand
Gotcha. Is this expected if you don't get the underlying number of features right? I would expect it should work to produce at least a linear decomposition.
Just not the underlying linear decomposition in your results
yeah, it isn't converging to a highly accurate reconstruction (it gets within about 1e-4, but stops improving, whereas with the exact right number of features it gets within 1e-6 quickly and keeps improving). and I could believe that this would recover some features, just not the canonical ones, but if thats the case, why do sparse autoencoders succeed at that task?
This is the only mention of K-SVD in that dictionary paper I linked, but only said it's not useful for large datasets. I would've expected it to work here though.
It'd be a weird coding error for it to work at the right number of features and not the others
Brass tacks though: I think we’re done. Like these results are huge. There’s a few due diligence things to do, better codebase, and applications, but everyday research can handle that.
we might want to try a kind of hybrid approach where we use OMP/FISTA for the encoder and just regular optimisation/least-squares regression for the dictionary
hoagy was saying Anthropic was looking into that for some reason
on a similar vein we could tie the weights of the encoder to the decoder
This seems an overly strong claim. I'd still like to see
- more thorough analysis of good hyperparameters (inc. bias on decoder, bias norm etc)
- large-scale (auto)interp on (all/sufficiently-representitive sample of high MCS features)
- causal scrubbing with learnt features on specific algorithmic tasks
- an analysis of how complete the feature set we learn is (for example, performance when using idk only features > 0.9; might be a bit pointless because it's heavily overcomplete, but could equivalently do 'performance loss when replacing layer X with reconstruction from sparse dictionary'
- also, if features are truly correctly represented by the TMOS model, we should probably have a sparseish covariance matrix when rotated into the top-k features for k<dimensionality of residual stream. not sure we have this
like, ideally, we should be able to identify circuits using this, and we haven't explored this yet
I'm not even convinced that residual stream representations are 'mostly' linear
Something something ROME edit failures or whatever
I also agree on all of these (though unclear on the meaning of the last one). Like work does still need to done, but maybe I'm more confident that this will just end up working.
This is one of my questions: Residual stream seems really good, but I'm expecting a large amount of redundancy if you learn dictionaries across multiple layers, which makes circuit like stuff hard.
wdym
so like if features are sparse and mostly represent independent things then we should see that the covariance of activated features reflects this
I guess I'm using the dictionary as a proxy for feature activation here which might be problematic idk feel like that should cancel out
The residual stream carries information from layer 1 to 2 so I expect dictionaries learned for both will have a large amount of overlap
Lecun's paper actually shows differences across early, mid, & late layers: https://arxiv.org/pdf/2103.15949.pdf
surely that makes circuit stuff easier? like, if we can treat some directions in both dictionaries as equivalent, we have some of the work done for us
covariance of activated features in one layer?
I'm on data, don't want to load pdf; what differences?
Ah, I agree. It just seemed wasteful, but we can totally be wasteful & inefficient if it works.
Early layers were more like single meaning words & later layers were higher level features (I'd assume like the german one or repeated tokens one?)
Though I saw several different types in layer 2 of Pythia
oh, right, that, yeah there's going to be some difference but it's a continuum and for neighbouring layers I'd expect high similarity
I guess I want better denoising encoders for this
Posted recent results on the residual stream:
https://www.lesswrong.com/posts/Q76CpqHeEMykKpFdB/really-strong-features-found-in-residual-stream
I’m enjoying reading this! If you want us to promote this or any other materials y’all put out on the EleutherAI blog or social media just ask.
I noticed a couple weirdnesses in this paragraph:
To be clear, I am running datapoints through Pythia-70M, grabbing the activations mid-way through at layer 2's MLP after the GeLU, & running that through the autoencoder, grabbing the feature magnitudes ie latent activations
”Pythia 70M” is actually named Pythia 160M. I know having the model names change is annoying, but it’s much less annoying than having different models in the suite follow different naming conventions!- The MLP and Attention layers in Pythia are computed in parallel. Does “after Layer 2’s MLP” mean “before Layer 2’s attention layer writes to the residual stream”?
Finally I was wondering if you had tried applying the Tuned Lens and if the Logit Lens was giving you better results, or if you hadn’t tried the Tuned Lens yet. IIRC the Logit Lens does work reasonably well for Pythia, but you should expect it’s behavior to fall apart when examining other models like BLOOM and GPT-Neo
Also, @keen pivot you now have the Research Lead role. The primary change this brings is the ability to pin and un-pin posts, edit channel descriptions, delete posts, and assign low-level roles.
Please use this power primarily to manage this channel, but if you feel like spending some time doing miscellaneous moderation tasks and cleaning up spam we’ll hardly complain. You don’t have the power to ban people; if someone needs banning that will need to be referred to a Staff member.
Add to this list:
- more toy model stuff trying to fit the behaviour of the actual data
- testing hypotheses about why the residual stream outperforms MLP stuff using toy models (maybe look at sae effectiveness with noisy data)
- compare perf over entire dataset Vs like QA or math
(keep aiming to look into this and never doing it)
(maybe I will eventually, but not particularly confident about actually getting any useful results)
Thanks!:) I'll probably take you up on getting it promoted, probably w/ a better post this week.
-
Updated the Pythia 70M to 160M on both posts; thanks! [Edit: Looking at the table of the bottom of https://huggingface.co/EleutherAI/pythia-160m, pythia-70M is 6 layers & 160M is 12. I'm currently using the 6 layer one, so unable to square this]
-
Correct. Last post was "mid-MLP" as in after the first linear layer & activation function. The latest post is the residual stream w/ much better results (link: https://www.lesswrong.com/posts/Q76CpqHeEMykKpFdB/really-strong-features-found-in-residual-stream)
-
Regarding Tuned/Logit Lens: the latest post does get much, much better results here, this is for both the logit lens & ablating the feature direction. I would like to integrate Tuned Lens here though, especially for larger models & if applying to BLOOM/OPT/GPT-Neo.
late to this sorry, but i asked Lee and he said that he tried it, but that he couldn't get it to learn well (low MMCS), but that he thinks it should work, and is interested to see it tried again
The strongest effect of residual stream is a different set of features (maybe a lot of embedding ones?, but also some overlap) & much stronger direction ablation effect/logit lens.
I still haven't trained a non-affine decoder on residual stream yet to disentangle that part.
Ugh I should never say things after midnight. Apparently I just had a brain fart and you were right about model sizes
Not sure what you mean by 'ablation effect' here; could you elaborate?
Happens. Thanks for the update!
Both logit lens & ablating the feature direction rely on a good feature direction.
If we have a feature direction, we can project onto it's orthogonal direction everytime it has non-zero activation. What I actually do is subtract by that direction*magnitude.
something i'd be interested to test in the upcoming sweep is what would happen if for the MLP activations we constrained the features to be positive only.
you'd need to allow a bias to help it account for negative activations
but i wonder if it would help it towards correct solutions given that we expect most features to be pretty much entirely positive - it's kinda strange to think about what a negative valued feature would look like in the MLP, like working only in the negative range of the GELU would make it extremely sensitive to interference. i suppose you could have a 'feature' which cancels out the activation of other features at certain times, though maybe that would better be understood as just being a part of how the activation conditions for the positive features are defined
will check later today whether we're seeing significant positive activations in our current dicts
sent an email to wes gurnee about getting the distributed features that they found in early pythia layers which respond to particular n-grams
am now on PST time btw! and will be properly back in the swing of things on monday
yea I was thinking about literally just slapping ReLU on the activations when we're preprocessing
should be equivalent mod dead neurons ig
why would preprocessing with relu have the same effect as constraining features to be positive?
no incentive to learn negative features
the kind of negative interference i'm imagining could still happen with positive-only activations i think
might be misunderstanding what you mean by positive here
just that each entry in the decoder matrix would be positive only
if you mean 'only directions with +ve coefs' should be the same
well, no incentive to learn negative features
or, if negative features were learnt, they would be equivalent to 0
like, I don't see why the encoder should need to use the negatives in the activation at all
oh I am totally misunderstanding yeah constrain it to be positive ignore me 👍
like, I would see passing the activations through ReLU as "cleaning the GELU noise so the dictionary doesn't pay attention to it" and constraining the dict to have +ve coefs as "limiting the computation the dictionary can do", is that what you're getting at @pallid current?
yeah i think this is about right. i don't think it's right to call GELU negatives noise (i actually wonder if it helps reduce interference) because i highly doubt that the model would work well if you added the extra relu, but yeah 'limiting the computation' is about right
i'm wondering if there are activation vectors that the model learns to explain as (a * feature x - b * feature y) which doesn't fit my model of how features should work
but that also might be a problem for my understanding of features so i would only be doing it as a tentative test
Can probably test this
which bit? i'm really interested in what would make a good test for whether gelu helps with interference, probably along the lines of the test that you wrote up a couple of weeks agoi
somewhat related but there's something i really want to test with that setup that i thought of last night. setup is you have multiple MLP layers in sequence which are trying to calculate n distinct features, where n is more than total number of neurons. question is whether, for each feature that is calculated, are they calculated in one layer only, or does the neuron e.g. do some preprocessing in layer 1 to calculate in layer 2, or perhaps use layer 2 to clean up interference from layer 1?
total meaning 'more than both MLP layers combined' here?
yup
-
- learnable l1 (based off features/activation)
-
- Re-init features when dead
-
- Perplexity difference
- a. When replacing whole layer w/ reconstruction
- b. Just high-MCS features (and potentially high-MCS datapoints only)
-
- Keep features if self-sim=0.9 after N_GB & not dead
-
- Decoder L2/simplicity term
-
- Affine decoder vs linear
-
- Changing toy model to better match current performance
- a. write down current differences
- b. Residual stream outperforms (maybe look at sae effectiveness with noisy data)
-
- compare perf over entire dataset Vs like QA or math @bitter turtle , what metric for performance did you mean
-
- Circuit finding & causal interventions (on algorithmic tasks?) @bitter turtle
- a. How to find circuits if doing residual stream
-
- Tuned Lens
-
- Better wandb/aws setup
- a. Easily get graphs on same page (Do we just manually do it every time? Groups mess it up)
- a. naming scheme when uploading to aws isn't useful. Just timestamp, when model_name & layer would help
-
- Large model features
- a. switch to bau-kit for >6B models
- b. 1B-param features
-
- Auto-interp - good for hypothesis refining of input, but what about:
- a. hypothesis testing on effect on output (ablating direction/logit lens)
- b. marking "interesting" features (or categorizing features in general?)
-
- Aiden's TMOS k-covariance thing(?) @bitter turtle
-
- Talk to expert in dictionary learning
- a. Anthropic
- b. maybe MIT person Logan knows?
-
- Compare w/ Baselines: PCA & Reconstruction ICA
Aiden, could you clarify these things? (I think you'll need to explain some of them again, sorry. I can read back later, but currently pinning post. No hurry though)
ok so the circuit/smaller dataset stuff is basically because I slightly feel like you'd get more linearity/truthful representation by a sparse basis on [some] limited algorithmic tasks, and I wanted to see if that was the case.
@pallid current when we do the big run to compare things, it’d be good to have a set seed for the data generation as opposed to just shuffling. Also, I believe the Pile is already shuffled by default
agree on the seed, what does it mean to shuffle by default? just the parameter that goes into load_dataset(shuffle=True)?
in the current setup, they all get fed the exact same data at the same time, just need to specify the seed at the start
True over the course of one run, but not two runs, right?
I think shuffled in the shards by default
What do you mean by 'run' here? Also I was talking about the current setup for the big run on the pod sorry for not specifying that
If the big run handles all hyperparams we care about it should be good, but it’d be good to still have a set seed for the data shuffle if we care about replicating later or think of some other setting we’d like to compare to.
We got recommended a paper written a few years ago ( @pallid current , I forgot the person's name?) for variational sparse encoding: http://proceedings.mlr.press/v115/tonolini20a/tonolini20a.pdf.
Section 2 gives a nice overview of related work to contextualize it, but I'm confused on the claim of what normal sparse coding is missing that VAE's help w/ (or why not just an AE like our work?). Haven't read more than 10 min, but would appreciate if someone else could look at it!
@keen pivot willing to go through this with you so we can bounce thoughts off eachother; first thoughts are that
- we get more control over the latent space by using a VAE, which might result in better learning/convergence if we set our hyperparams right
- typically VAEs use a more powerful recognition model than our current encoders; probably useful if transformer representations are fundementally nonlinear/better denoising is needed than a simple RELU layer
I believe @blazing yoke was talking about this in #eliciting-latent-knowledge at one point, they might have been the one to reccommend you the paper. I'd be very interested in persuing this.
Thanks!:)
Also a note on AEs vs VAEs: https://stats.stackexchange.com/questions/324340/when-should-i-use-a-variational-autoencoder-as-opposed-to-an-autoencoder is probably fine.
If you'd like to take ownership of implementing it, that'd be great. I'm currently doing different sparsity constraints atm, and maybe even tied embeddings if we're not doing that in the big sweep
Hey - I mentioned the paper to Lee a few days ago so may have been me. I've been loosely keeping tabs on this thread. It's a good point that the paper didn't compare with normal sparse coding. A friend of mine wrote the paper, so I'd be very happy going through it with you. The answer aidan posted is pretty accurate. There's many principled differences between VAEs (which aren't really auto-encoders), and auto-encoders.
would absolutely love that, I have no formal grounding in VAEs or pretty much any principled machine learning, would find this v useful
Generally, and in my experience, autoencoders need a lot more hacks for learning the kinds of representations you want. They suffer from things like mode collapse, and it turns out the isotropic gaussian latent space is kind of an okay choice in VAEs.
VAEs are super easy to implement though - I could show you the ML/code side in < hour, and the probabilistic theory in a couple hours.
Yep, it was from Lee. Thanks!
Ignore most of this code that I haven't touched in a long time, but this is the de-facto implementation. https://github.com/SalmanMohammadi/odd-one-out-representation-learning/blob/7989b74de0aa76f5a63dabda7baf1c0105adfd5a/models/models_disentanglement.py#L111
oh was about to say found one but would still value a summary paha
Not sure if you're still looking for thoughts on the paper @bitter turtle - I hadn't seen it before but I've skimmed through it. Could you share the summary you found?
Intuitively, from my perspective, it follows a different probabilistic derivation than the VAE paper
not the summary
but was cited by http://proceedings.mlr.press/v115/tonolini20a/tonolini20a.pdf and im confused about the inutuion
Yes, I was.
did you get anywhere with this?
one sec let me bring up the messages you sent originally
I guess you were approaching it from a slightly different perspective
When trained effectively, the Variational Autoencoder (VAE) can be both a
powerful generative model and an effective representation learning framework
for natural language. In this paper, we propose the first large-scale language
VAE model, Optimus. A universal latent embedding space for sentences is first
pre-trained on large text corpus, and t...
Yeah, I ended up deciding the right thing to implement was probably this.
But I ended up doing other projects first.
Currently working on https://github.com/JD-P/minihf
could you elaborate on why?
There don't exist a lot of language VAE architectures.
But um, Katherine made a flow encoder thing you could add to it that would make it implement the right inductive bias for language.
?
Which was inspired by https://arxiv.org/pdf/1908.11527.pdf
Normal VAEs don't really work well for language
So you have to use like, an iVAE or one with hyperbolic geometry
We were going to combine Katherine's flow model with Optimus.
You'd probably be better off asking her for the technical details, since I didn't implement any of the flow model.
@opal basin
Oh, the normalizing flow part was to allow the autoencoder to learn distributions with arbitrary/weird shapes while retaining the ability to sample latents
Because with a VAE your posterior is normally diagonal Gaussian which is not good for language
We haven't actually tried it on text yet
If you don't care about the ability to sample from the distribution of latents, or determine the information content/likelihood of a latent, you can use a normal autoencoder (not VAE) which also lets it take on arbitrary shapes
Implementing this was still on our todo so, would be very happy if you did.
we'd be doing it for transformer internal representations not text, not sure if the non-gaussianity thingy still holds there
ahhh
why don't gaussians work for text
tbh i'm not clear on the exact details but they impose a Euclidean geometry on the space and for text you want the ability to stuff trees into the latents, which means you want hyperbolic geometry. this is kind of conjecture
"Adding Gaussian noise imposes Euclidean geometry and this is empirically not good for text" is the part that isn't conjecture i think.
Euclidean geometry can have a bunch of things wrong with it https://arxiv.org/pdf/2002.05227.pdf
riiiight so would the conjecture would like imply that transformer internal reps are well described by something on hyperbolic geometry, so maybe it still applies here? we might end up using it if it works better, im adding to the list of evergrowing things to test at some point in the future
nods
I wanted to be able to sample from the distribution of latents and determine their likelihood so I took a normal autoencoder and added an RNODE (https://arxiv.org/abs/2002.02798) that converted from its latent space to N(0, I) and back.
Training neural ODEs on large datasets has not been tractable due to the
necessity of allowing the adaptive numerical ODE solver to refine its step size
to very small values. In practice this leads to dynamics equivalent to many
hundreds or even thousands of layers. In this paper, we overcome this apparent
difficulty by introducing a theoretical...
But um, to my memory I said a couple different things about VAEs and ELK. The most relevant one is probably that the use of a decoder only transformer makes ELK harder because you can't easily get embeddings and explore the latent space of the model and characterize it. A VAE is useful because it lets you estimate the information content of the latents.
It worked quite well for CIFAR-10.
The sampled fakes actually looked like real images, which you usually don't get fully with a VAE!
Which for ELK is important because it lets you figure out if your translator is hiding detail. If there's a mismatch between the amount of information in the latent and the explanation, you know something funny is going on.
don't understand, could you say this in slightly longer form, also what's this for?
Wasn't there some complication you ran into with scaling that caused you not to use it for text right away?
With a normal autoencoder you can't sample from the distribution of latents to be able to generate completely new fake images that resemble the training set, because it can take on any arbitrary distribution.
With a Gaussian VAE you try to approximate the posterior by sampling from the N(0, I) prior and decoding that, which you have tried to make the posterior (encoder output) resemble, and this kind of works
yeah but the RNODE thing
With my flow autoencoder you can sample from N(0, I) and run it through the flow model to obtain totally new latents to decode.
RNODE is a continuous normalizing flow, an invertible map between two arbitrary probability distributions.
In this case it maps between N(0, I) and whatever the learned distribution of encoder outputs is.
oh shit ok 👍 magic black box to convert distributions gotcha
+1, but in practice a VAE works by learning the mu and sigma for N(mu, sigma), and you sample from that rather than N(0, 1)
But you'd like to be able to sample from the prior and get things that look like the training set but aren't. It just doesn't usually work that well in practice.
(Hers does though)
for sure
But yeah, she was trying to figure out how to scale it and that's where we got sidetracked by other things I think.
Initially I don't think we care that much about sampling from the training distribution, plus superposition kind of conjectures that features follow ~a spike and slab distribution (https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478694) so I'm guessing that http://proceedings.mlr.press/v115/tonolini20a/tonolini20a.pdf will end up higher on our list of priorities
Had you heard of the spike and slab distribution before?
nope
When Francesco brought it up for his paper he said it was an (obscure?) Physics thing.
but it seems close to what superposition is saying; toy data for testing SAE methods is generated using ~that distrubution
I’m familiar with spike-and-slab regression
oh?
[deleted article because it’s not very accessible, let me look for a better one]
Is that just linreg with a spike+slab prior on the coefs
Yeah
It’s spiritually similar to ridge and lasso
It’s used to quickly work through a large number of mostly useless variables
Spike and slab is a Bayesian model for simultaneously picking features and doing linear regression. Spike and slab is a shrinkage method, much like ridge and lasso regression, in the sense that it shrinks the “weak” beta values from the regression towards zero. Don’t worry if you have never heard of any of those terms, we will explore all of the...
yeah honestly given my complete lack of background in this it's all very much a blur and they morph into one
At the end of the day it’s like quibbling over the names of different types of knives. Maybe useful for propel who really care about knives but you probably just want to make sure it’ll work and not cut you.
pahaha fair enough that's mildly encouraging
If it's ok given we have so many random encoder/decoder things to try, I might spin some off to the AIS group I corun at my uni, we're trying to upskill ourselves atm with research projects. People would be @coarse flint and maybe @thorny cypress
Oh please do!
sound
bit confused about this in the answer: 'VAEs are known to give representations with disentangled factors [1] This happens due to isotropic Gaussian priors on the latent variables. Modeling them as Gaussians allows each dimension in the representation to push themselves as farther as possible from the other factors.' as i understand it, if you have an isotropic gaussian prior then the distribution should be invariant to rotation, which means that there's nothing distinct about the particular basis. where am i going wrong?
also finding it difficult to get an intuition for the role of the pseudoinputs in the VAEs. @kind scroll if you've got time to walk us through the paper some time in the next week that'd be brilliant!
the prior is isotropic, but the posterior is diagonal - the encoder outputs a mean and a log variance per dimension.
I suspect the answer is wrong and it's actually due to the posterior being diagonal and thus not rotation invariant
what are pseudoinputs?
instead of having a purely Gaussian prior you have a set of learned pseudoinputs which you feed into the latent space predictor to get your prior(s) which you then mix like mixture of gaussian
This is better because (?) and it incentivises high-variance posteriors/latents/eh
The paper that introduced it (?) https://arxiv.org/pdf/1705.07120.pdf does a decent job but I'm still confused
ohh
This is correct, I think vanilla VAEs don't have true disentanglement because their latent spaces can be rotated arbitrarily
Be happy to!
i believe beta-VAEs are supposed to be the "extra disentangled" flavor
yes, but they only have isotropic gaussian priors, and we are wondering why that prior encourages disentanglement
I'll link a couple papers later today. The short answer is: it doesn't really. It's partly just nice for an analytical form of the evidence lower bound.
re PCA as a baseline, fitting a GLoRA decomposition might be interesting to investigate as well. https://arxiv.org/abs/2306.07967
We present Generalized LoRA (GLoRA), an advanced approach for universal
parameter-efficient fine-tuning tasks. Enhancing Low-Rank Adaptation (LoRA),
GLoRA employs a generalized prompt module to optimize pre-trained model weights
and adjust intermediate activations, providing more flexibility and capability
across diverse tasks and datasets. More...
- Review of different methods:
l_1/2 - Nothing noticeably different than l_1 (no better sparsity for the same reconstruction loss). Worse high MCS (though maybe optimizing for an even better alpha would work). Didn't look at individual features. - Adding noise - can't compare (sparsity, reconstruction loss) because it's at a strict disadvantage to the non-noise one (would need to compare on clean data for both). Learns maybe 1/3 number of features (boo) but 0 dead features (yaaay).
Looking at individual features, many appear just as meaningful as l1, however, the logit lens is much worse for some of them (Logit lens being worse isn't as meaningful as ablated direction being just the same). Additionally, there are like 10 features for (beginning and end of first sentence), whereas l1 only has 1 of those.
Could try tied embedding for both l1 & noise to compare.
Though, I'm unsure how normalizing the weights of the decoder (which are now the encoder) will effect things?
awesome
I've just about got the big sweep code debugged we can set it off tomorrow
https://github.com/Baidicoot/sparse_coding/blob/main/big_sweep.py <- someone think of good hyperparam settings
I guess I'm confused about implementing the tying. Like I want them to be in the same direction, but I'd be okay w/ different biases. Same w/ only normalizing the decoder
Residual stream or MLP? & any specific hyperparams?
can do either not sure which one we should do first
I guess just good lr values, good values to test, also not a hyperparam but whatever we want to keep track of
For tracking, I want features/datapoints ('sparsity') & dead features:
dead_features = (dict_levels.detach().mean(dim=0)==0).count_nonzero().item()```
Default lr has been good to me.
what length of time are you calculating dead_features over? if i remember correctly, if you look over a long enough period of time you rarely see any dead features, it just drops off over time
I think pretty long. Looks like 1000 batches
ok fair, in that case i think we'll want to measure the average activation, or total number of non-zero activations to understand whether they're dying
maybe even also average activation, given that it's active
and average activation for dead features will be 0
like, we want to distinguish between 99.99% 0 activity and something a tiiiiny bit, vs usually 0 with occasional strong activation, which could be a healthy feature
I'm unsure, could be very rare features
yeah possibly tho i think we found that it was super rare to have high MCS at those low activation frequencies. not conclusive for sure tho cos you'd expect rare feats to be found less often
Update: Ya, adding noise (instead of an l1) does produce sparsity & sometimes meaningful features, but not as good as l1 & importantly the logit lens just sucks w/ it (compared to l1)
Okay, I now don't feel strongly about dead-features/average_activations.
@bitter turtle, early results are in & tied embedding looks quite good. I believe you're the one that suggested both tied embeddings & residual, both of which have been really good, so thanks!:)
Additionally, tied embeddings may allow it to work on the MLPs if we want.
Of biggest note: I'm getting ~1.4x as many features w/ the same amount of data. Plus the added benefit of reading in from the same direction we're writing out (though I have two different biases because they're different shapes)
Of also big note: average features per token went from 100 (untied embedding) to ~20 (tied embedding) when optimizing over L1_alpha for high-MCS.
features = MCS > 0.9?
Yep, though maybe it should be 0.8 as the heuristic
sweet, how many you getting on what activation dim?
pythia-70M, so 500
Looks like 1k for the 2k dict
Probably more if I went larger
And the features & logit lens makes a lot of sense.
fiiiiiinally managed to replicate @keen pivot's interpretations of high MCS neurons using the openai autointerp system
no idea what the bug was but yeah should be able to have more confident autointerp results coming along in the next few days and can actually help with the main bulk of the work
i'm a bit behind but i think the first thing to do is a comparison of high MCS / low MCS? most important is to do sparse coding vs neuron basis, but i think we need to adjust for the negative biases before we can call this a fair comparison
I should have a better dictionary to link you to on the bucket by tomorrow.
I think the comparison is fine as long as we make a note of the concerns when we report on it; conditional on the SAE having some reconstruction loss below a threshold I don't see why we couldn't compare them directly.
For sure you would be less able to view the directions discovered by the SAE as wholly meaningful if it turned out the bias mattered a lot but I'm thinking that even in that eventuality you could still get some millage out of the representation.
oh lit yeah that'd be good
hmmm i think this would definitely be fair if we could show that performance degradation when replacing the activations with the reconstruction was fairly low, but i don't think that's likely any time soon, and i'm not sure what other threshold for sufficiently low recon loss we could use
am more excited by just comparing neuron basis with negative bias + relu applied, but will also compute the raw comparison
the features are so much richer now that it's fixed 😊
Feature 8, explanation='phrases and keywords related to the legal system, law and legislature.'
Feature 8, score=0.29
Feature 9, explanation=' underscore characters, especially in the context of code or programming syntax.'
Feature 9, score=0.26
Feature 10, explanation='terms related to astronomy and cosmology.'
Feature 10, score=0.41
this is without the aforementioned adjustments to the neuron basis so take with a pinch of salt
but:
though also, i'm not seeing a strong relationship btwn MCS and autointerp score
How many tests is that, just wondering what levels would give less noisy data
roughly 60, 60, 120. can scale up easy tho i think the difference is very clearly significant
neuron basis vs sparse code is like 5 sig diff in means
missed a few actually so new one is
fun but shouldn't get excited until we apply the negative bias, i expect impact to be large
hmmmm first indication is that adding the relu(activation + bias) makes the neuron_basis interp worse, good if true but very surprising to me
i suppose it could cut off legit activations, making the simulation less accurate
still, need to do a bit of checking before i feel confident
ok i think it's working at intended !
graph's a mess but it's looking really good! more hyped about sparse coding than i've been in ages!
How'd you do neuron-basis-bias?
just took random biases from the encoder and added them to the neuron output, then added a relu
all biases were negative so it makes some level of sense
i realised that the bias should be scaled by the norm of the encoder tho, might implement that tonight, otherwise tomorrow morn
wonder if there's a more principled way. in the openai autointerp paper they get the gradient of the feature wrt the interp score, but i think that would then be unfair in the other direction, as well as sounding like a pain
you could also target a particular sparsity of activation
Would love to see the ICA and PCA baselines, but that looks crazy good!
@pallid current the autointerp develops hypotheses from max-activating examples, right? For me it's misleading because the max-activating are sometimes too specific, so I look at a uniform distribution.
Also, is this the MLP dictionary?
Do you have a list of goals or milestones for auto-interp?
Will run this morning
hmm i mostly see autointerp in the near term as being something which can help give us a better signal on the quality of our learned dictionaries, so i'm keen to join back into the main effort and bring in autointerp when we need signal for what's working
meeting lee in a sec so will try use that to write a proper plan of things to do but i plan to:
- do a bit more work to try and make sure that the comparison with e.g. neuron basis is a fair one
- write a quick post showing the results
- also, if the big sweep goes well, getting autointerp scores for sparse coding and a few baselines on lots of different layers would be worthy of a major writeup, possibly a paper imo, i think it would get a lot of people interested
will try to have the preliminary writeup soon and then talk to anthropic with that in hand
These are the things I thought of for auto-interp. I can try implementing them in a week or two (though looks like you've got PCA/ICA covered!):
- Improving prediction of input:
- Include ablated context one-token-at-a-time effect
- Predicting ablated output
- PCA & ICA (what do top components look like?)
- "Interesting" directions (like accents, medical-speak, SE-speak) & in general categorizing features
getting OOM errors on PCA 😦 not sure what's changed, might just be using more data now, will try both switching to @bitter turtle's batched version and just reducing the amount of data fed in.
Re: dead neurons, I'd also like to get around to doing the dead-neuron reinitialisation (and maybe low-MCS reinitialisation?) at some point
i can have a look at this if you're not about to do it immediately, though i think we should look carefully at some autointerp stuff before doing it because i didn't see strong evidence that low-MCS were significantly less interpretable. logan might have more evidence on this? i want to check if there's a stronger pattern for activation frequency or magnitude
I'm also kind of thinking of this as another method to test goodness of MCS as a metric
what's the measure that would tell us whether MCS is good? just whether reinitialization of low-MCS produces better dicts?
sure
definitely sketchy, I think we should also look into just accumulating tons of other possible (non-ai-supervised?) metrics
in current runs, are we still seeing recon loss and l1 loss rising through the run?
i feel like that should be a bigger part of our metrics than it is
^
suppose at large intervals we could run the perplexity check that we spoke about before, run some comps to see how well the model is able to function with different dicts
btw recently tried the method of applying biases from the encoder but where the bias is scaled down by the norm of the autoencoder, doesn't help the autointerp on the neuron basis at all
I haven't checked for the latest tied embedding, but if low-MCS feature is dead, then that matters, but not in the interesting way.
It does seem plausible that larger dicts may learn or retain different features than smaller ones. Larger ones tend to have slightly larger average features/tokens, so may overwrite different features even on the same dataset.
Regarding bias, the tied embedding has this for the encoder:
I looked at a few in the right cluster & they're typically dead (~0 non-zero activations). This is ~400/2k features. So maybe just originally init bias of encoder to uniform[-1, -3]
I will note, I've noticed before the one odd positive bias feature. It like kind'of looks like a feature regarding the input, but doesn't have a meaningful affect on the output nor meaningful logit lens.
I was going to look at a data-centric viewpoint: given a datapoint, how many features activate? Do those features make sense? For example, if most of them make sense, but this positive bias one always activates, then that's a clue that there's funny business going on.
not convinced this is a good idea; want to wait until we do some runs with weight decay on the bias
Gotcha. Are you able to explain your intuition here?
well, ideally we want to find directions corresponding with useful features, and the point of the bias + relu is to act as a bit of a noise-reducer to cancel out the interference effect of other features, which shouldn't* be anything too significant
*we should check the variance of activations, but if they are even like less than 100 (or 1000?) or something (ballpark orders of magnitudes) then the bias is doing something more than basic denoising
this is suuper weird to me, i wouldn't have thought it was possible to be dead without fairly big negative activation, though i spose the space is so large that it's possible to have lots of halves or large segments that are totally dead, like i remember there being some results about the clustering of embedding vectors in smallish parts of the residual stream
agree that initializing negative biases doesnt sound good for similar reasons to aidan. like, you find a direction, and then make the bias negative to remove the noise, but if you just start with large negative bias you're quite likely to just find nothing at all
whats the diff btwn the two graphs?
y-axis. One is MCS & other is non-zero activations
Wouldn't this just mean that the weights feeding into the ~0 bias tend to sum to negative (I want to say negative, but residual stream activations are negative and positive)
Are we also running weight decay on the weights themselves? This is referring more to the anthropic information metric here.
ohhh sorry yeah of course there's a huge reduction in the space post nonlin
only applies to resid stream
which are the graphs from?
not unless something's changed recently
Residual stream, tied
Hmm I guess maybe we might want to, but I'm not sure how good their metric was or whether a norm on the matrix elements would improve it
So short term probably not is my take
i'm still interested in this on theoretical grounds because it seems like we should expect features to be composed of a small-ish number of neurons if it is using the non-lin in the way we expect. it got semi-shelved when we found that the features sparse coding was learning were weirdly less sparse than random vectors. but i understand that seems to have been an artefact of working with the nanoGPT model?
uh. how are you defining sparse here?
im just confused how they can be less sparse than random vectors, which should just be not sparse
a diversity metric on the vector, (simpson index in here https://en.wikipedia.org/wiki/Diversity_index)
maximally nonsparse by this metric would be every element being equal
comes out basically the same as entropy
hmm. is that the definition we want for sparse here?
maybe not exactly but i think it would give a strong signal if the features were focussing on only a few vectors
what dyou think's missing there?
some sort of centering at zero, but yeah would give out strong signals for sure
don't think you can read too much into the sparse coding vs random vectors thing tho, it seems that random vectors should be close to totally unsparse by a better definition
I mean, I literally think 'normal sparsity but close-to-zero' seems reasonable. (or maybe if some coef contributes say 1/100*n_dims of a vectors norm or something arbitrary like that)
i think they are pretty much totally unsparse, so less sparse than that was shocking! either way tho i think its an outofdate result
current residual stream results. here the '''neuron basis''' seems to be good?? and matches the sparse-coded features.. but this time the top MCS features are significantly better, and both far out perform random. the green is the sparse coded features after 1 epoch which is somehow worse than random (low sample size)
Are you able to look at specific examples? There's 3 in the neuron basis that have a score of 0.5 (maybe 6 for .5 & 5.5 in total?)
i think these results are basically positive and show that we are getting something legitimate from our dictionaries, and we need to both scale up to more layers, and refine our learning process
Also, would -1 mean reversing it's answer would give us 1? ie reverse_score = abs(10-oldscore) or something like that?
yeah like the explanation is perfectly anticorrelated with the activations
I liked your earlier statement of quickly being able to check if a new setting (ie bias or tied) actually helps improve.
yeah i can give you the ids and explanations, ids are [1, 7, 37, 66, 69, 90] and explanations are all pretty similar: numeric values, sequences, and lists., dates, particularly those written in the format of month and year numeric values and codes, including year dates and programming syntax., numerical data and product identifiers ,numerical values and sequences numeric values, including single digits, multi-digit numbers, and percentages.
Something else I've noticed in the residual stream is we're picking up the weird language model stuff, like the 8bit guy? Oh here: https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
When I attended NAACL, I wanted to do a little test. I had two pitches for my LLM.int8() paper. One pitch is about how I use advanced quantization methods to achieve no performance degradation transformer inference at scale that makes large models more accessible. The other pitch talks about emergent outliers in transformers and how […]
For Pythia 1.4b, there's a HUGE activation of 1.5k. Typical activations are in the 1-80 range, w/ maybe 5 as the median max value
One hypothesis is we're picking up on what Tim called these outlier dimensions that models have, which coordinate in the 6Billion range & get really big. Smaller models have lots of these, but they're not coordinated, so you get several of them in the 60 activation range.
Oh, the top_mcs ones are mostly simple ones. Of the top 40, maybe 30 are single category words (e.g. his/her/their) or simple patters (e.g. the token right after any letter "L"-containing tokens)
I'm curious how it does regarding the more style-based ones. Like the SE style or Chemistry, etc
blog is v good. surprising though as he seems to have clear evidence that those features emerge at a level quite a way above what we'd expect out of pythia70M. he says it's a perplexity based threshold but i cant imagine py70m being capable of matching OPT6.7B
A bit more subtle than that. He claims that these outlier dimensions exist in models as small as 160M, but they start aligning in the 6B region where they shoot up really huge
Which is exactly what I noticed: Pythia-70M has several "beginning & ending of sentence" in the range 60 activations. Pythia 1.4b has 1 of the same type that's 1500 & another that's in the 60 activation range.
Two things of due diligence for myself:
- Check that the direction mainly comes from one dimension (which I believe Tim is claiming)
- Check the amount of features of this type for both models, plus their activation range
ahh cheers yes i should have kept reading. your 1600 values is way off the charts still tho:
"The phase shift happens around 6.7B, where 100% of layers use the same dimension for outliers. At this point, a couple of things happen rapidly:
Outliers become very large quickly. They grow from about 15 for a 6B model to about 60 for a 13B model. OPT-66B has outliers of size around 95, which indicates this growth phase is temporary."
can imagine these things being very senstive to training setups tho
Ya this first one (the 1500 one) is mainly two dimensions:
For comparison, here's top-MCS 100 which is the feature "the [noun]"
plotting the contribution of each dimension?
The weight of the decoder & it's tied
This is the other one which is activation 50 for only first words (as opposed to beginning & ending):
resid or mlp?
residual
Though looking back at the google sheet, It (ie the feature "beginning & ending of sentences" w/ high activations) is also the top-MCS feature for layer 2,3,4 MLP
Additionally there's a high-positive bias one w/ similar properties:
Notably there are several overlapping residual dimensions (e.g. 568, 516, 468, 1326, 934, etc) w/ the first high-MCS & high-activations one.
Notably notably, they seem like opposite of each other when you look at activating examples.
Wrt getting disentangled features for VAEs: https://arxiv.org/abs/1907.04809
The framework of variational autoencoders allows us to efficiently learn deep
latent-variable models, such that the model's marginal distribution over
observed variables fits the data. Often, we're interested in going a step
further, and want to approximate the true joint distribution over observed and
latent variables, including the true prior ...
Oh ya, the 1600 value part may be caused by our normalized decoder(?) Haven't thought about it much
The fuck
Yeah how are you measuring this
Measuring what? The 1.5k activation is the top-MCS feature activation for the learned dictionary for the 1.4b model in residual stream
more graph horror but finding that the sparse coding directions strongly beat ica and pca, as well as all others, on pythia layer1 mlp
principal components start interpretable but fade quickly
also started separating out the top vs random scoring. results are generally good, the score goes down, but for any where the top score is >0.2, the random score is also almost always solidly positive
average drop of maybe 0.1
Oh that's awesome! Glad you've been working on this!
Are the top PCA/ICA directions just these outlier dimensions?
Their predicted text is usually pretty easy too
Oh, would be good to rerun toy example stuff on tied embedding autoencoder.
I think I'm getting dead neurons by overtraining.
i'm not seeing a correlation between activation level and interp score 🤔, also not with interp and MCS
Note this is tied residual, and my previous correlations were about untied (and MLP sometimes)
Things are looking good on the MCS & MMCS front. For residual, we're getting MMCS of 0.8 for 4k (d_model=500), 50% above 0.9, and the histograms look really good.
I'm getting the same results for layers 1-4, will additionally do layers 5 (& 0 for the heck of it). Usually I'd need a lot more data to get almost these good results for untied embedding, and the features/token here is ~20 where previously it was ~100
20 also seems right because you have features like "a-letter words" & "after a-letter words" along w/ a bunch of low-activation "noise" which also tends to make sense on it's own, but also might point towards a problem/
@bronze wraith , would you be able to think through the math of a tied embedding & what would be best to best reconstruct the original data?
As an example, suppose the weights are [1,2] & the latent feature is 10. Then the reconstructed feature is [10,20].
But if the input was [10,20] then the latent feature would be 50. I guess you could have a negative bias of -40 for the encoder for it to work on this example, but that's as far as I got & thought you'd be better at this kind of thing.
(additionally, we're normalizing the decoder weights because of the l1 penalty for typical dictionary learning, but that now means we're also normalizing the encoder weights)
I'm rather behind on this thread, can you explain what you mean by tied embeddings (or point me to an explanation)?
Also, I'll give it a shot, but I'm about to leave for a weekend trip, so I might not have much to say until next week!
The encoder and decoder are the same linear transformation, but transposed
Oh, no problem. Enjoy your trip!:)
Here is the relevant bits of the model definition.
self.decoder = nn.Linear(n_dict_components, activation_size, bias=True)
# Create a bias layer
self.encoder_bias= nn.Parameter(torch.zeros(n_dict_components))
# Encoder is a Sequential with the ReLU activation
# No need to define a Linear layer for the encoder as its weights are tied with the decoder
self.encoder = nn.Sequential(nn.ReLU())
def forward(self, x):
c = self.encoder(x @ self.decoder.weight + self.encoder_bias)
# Apply unit norm constraint to the decoder weights
self.decoder.weight.data = nn.functional.normalize(self.decoder.weight.data, dim=0)
# Decoding step as before
x_hat = self.decoder(c)
return x_hat, c
One thing I think I can say about tied embeddings: if your encoding matrix is M=[I_n, -I_n], I think you get a perfect reconstruction (i.e. M^T ReLU(Mx)=x for all input vectors x).
This works because Mx is [x, -x], whose negative terms get zero'd out by the ReLU, and then when it is multiplied by M^T you get back the original.
This is a solution to "given that our embeddings will be tied, what dictionary features could we learn to get a good reconstruction", but doesn't account for a) the L1 penalty, or b) the noisyness of the training process. And this solution isn't unique: you can get a perfect reconstruction with tied embeddings M=[U, -U] for any unitary matrix U (https://en.wikipedia.org/wiki/Unitary_matrix).
Those are some examples I thought through before, but Logan I'm sure you had some other questions in mind too. What other angles do you want to think through this from?
Oh mostly if there's a better architecture to use & to gain clearer thinking about this in general. Like "we for sure need biases" or "normalizing the encoder is strictly worse than ..."
I'll think about this over the weekend, and let you know if I come up with anything!
Great, thanks!:)
Also, no problem if you just enjoy your trip!
@keen pivot generally we should note that the L1 penalty basically solves this rotation-invariance if the underlying latents are sparse (rotation doesn't preserve L1 norms, and L1 norms are minimised when the rotation produces as sparse data as possible). Not certain how interference affects this when we have an overcomplete basis, but for e.g. binary features with some constant maximum interference between distinct features you get minimum necessary bias for total denoising when the rotation is aligned to the underlying overcomplete basis (this is basically the reason that I think L2 norm on the bias is a good idea; also in this case - perfect denoising - you get perfect reconstruction with the tied weights)
Of course, real life is More Complicated than this, but I guess this kind of explains the intuition for the guess of 'hmm tying weights seems like a goodish idea'
really like and agree on the first bit but why l2 norm on the bias? is the idea that we should try to remove only as much of the interference as it necessary, but no more? if so agree but then i think the algo should already do that? (though i suppose empirically seesm the biases are just creeping more n more negative so yeah maybe necessary)
also is there a theoretical reason why we in the tied autoencoder we have a bias on the decoder? i don't have strong opinions either way but i remember @worldly hinge being quite anti it, cant remember why
The only reason is to allow the decoder to still reconstruct statistics of the data w/o dedicating weights to it. I haven't tried w/o it, but that's pretty easy to try
i promise i'll fix these graphs soon but......
separating out top and random scoring, sparse coding outperforms all baselines for both top and random on the residual stream!!
Partially but also that we can obtain the minimum necessary bias to remove all noise by aligning the dictionary to the latents, so the L2 also encourages disentanglement. I think. Hopefully.
Would be interested to know. I think we should be adding the bias back on , surprised he is anti this. I am blindly guessing tho
yeah adding the bias back on makes sense to me too
@bitter turtle what's the status of 'the big run'? are we ready to try out multiple layers * multiple dictionary learning approaches
Can go whenever code works I think
Should currently be set up to save every chunk
@pallid current currently set up to just test different parameters (i.e. do a big grid search over L1, dict size, L2 reg)
Could set up to test tied weights Vs not, etc, wasn't sure what would be useful ig
splitting out only top scores also makes the value of sparse coding in MLP more clear:
Sorry don't understand the graph: top scores by what metric? Which label corresponds to those scores?
right sorry i never explained this at all. so what the autointerp does is to take 5 out of the 20 fragments of 64 tokens which have the highest average feature activation, from a pool of 50000 fragments. these are the 'top' fragments. it uses those to generate a hypothesis for what the feature 'is'. then, it takes another 5 random fragments. it uses the explanation to generate a guess for what the activations will be across both the top and random fragments. it then scores the explanation based on the correlation between predicted and actual activations.
so the score that i've been reporting previously is the correlation across those 10 fragments, called 'top and random' scoring.
but we found that this was a bit misleading for some of the residual stream neurons, because the explanation was able to distinguish clearly between the top fragments and the random fragments at the fragment level - ie high in the top fragments, low in the randoms, but it couldn't predict any of the variation within the fragments
so instead i'm now showing scores for correlation within the top fragments and within the random fragments separately
and on both of these measures, the sparse-coded features come out very clearly ahead, for both residual stream and mlp
ok, but is the code working? do you want to push to main and i can help debug if it's needed
yeah code works afaict; trained a few dicts for a very small amount of time on a small amount of data and loss did expected things, and it was faster I think
Can push to main if you want
On mobile ATM tho
cool no worries will have a look from your branch
and yeah i'm particularly keen to test different versions of the architecture, like i think they're going to be the most interesting kind of results, given that we're definitely seeing some level of signal
Wdym by this ('distinguish' particularly)?
Cool, I can start a grid search for tied weights and not and bias reg and not tomorrow?
sorry not search
big grid training thingy
oh right gotcha
yeah mb badly worded
'I can try a bunch of different hyperparams for tied and not tomorrow'
yeah that sounds great
what was the conclusion wrt reconstruction ICA in the end?
no idea haven't tried it never got round to it
can also try that
Not particularly expecting it to be much different from tied weights
yeah is there actually any difference (except maybe the smooth_l1 loss)?
no bias
no bias would be kinda interesting from an autointerp point of view (though could just remove the biases manually) just because it means that the found directions are on a totally even footing with the baselines
yeah don't think it's particularly interesting. More interested generally looking at explicitly nonlinear things, or better approaches at sparse coding (like FISTA or OMP or something equivalent that is nicely parallelised) and still using linear dictionaries
yeah agreed
have you looked at them at all?
slightly but just to the point of 'aaargh this is a nightmare to get to run fast'
i think they're good things to do at some point but unless they're super easy i'm leaning towards just running a really good sweep + auto-interp + additional analysis of the resulting data and aiming to publish based on that
Yeah that sounds good I agree with that plan
is there anything i can help with on the infra side today?
Could implement tied weight stuff, better logging, saving to long term storage, that kind of thing? Also forgot but I am hosting a friend's birthday tomorrow I probably won't be able to work on it until sun
It's a bit cursed, sorry about that, but it was the least hacky way I could think to implement proper ensembling
sure its better than whatever i'd have hacked up. tho i havent looked yet 😅
yeah basically short explanation is
- torch optimisers are not vectorisable
- defaulted to not using autograd at all and instead used torchopt for stateless + vectorisable optimisers
i.e. basically Jax in pytorch
posted my results using autointerpretation on LW here: https://www.lesswrong.com/posts/ursraZGcpfMjCXtnn/autointerp-finds-sparse-coding-beats-alternatives
time to scale up ! 🦾
Hey I am fairly new to interp stuff so sorry for asking but could this be dumbed down a bit: 'We give a feature an interpretability score by first generating a natural language explanation for the feature, which is expected to explain how strongly a feature will be active in a certain context, for example ‘the feature activates on legal terminology’. Then, we give this explanation to an LLM and ask it to predict the feature for hundreds of different contexts, so if the tokens are [‘the’ ‘lawyer’ ‘went’ ‘to’ ‘the’ ‘court’] the predicted activations might be [0, 10, 0, 0, 8]. The score is defined as the correlation between the true and predicted activations.'
I guess the main confusion I have is how you end up measuring the true activations of a feature and the predicted activations by the LLM based. Like what do these look like and how can you compare them? Maybe you've explained this already in the LW post and I've missed it.
Ya, I'm trying to transition entirely to baukit for both training and interventions, which would be worth it for scaling to 66B parameters.
I can also handle the perplexity check & maybe work w/ someone to do the activation adding/subtracting part? There's team shard folks & nina (who also does SERI, maybe also turntrout's mentee?), who can work on activation engineering using our found directions. I'm reaching out to them now.
oh, yeah i know nina, i liked the activation addition stuff she was doing
not, like, well, but she's probably just down the hall lol
I would look at the previous posts for dictionary learning & openAI's blog post on it to more fully understand it. It took me a few weeks to actually grok this project when I first got into it.
yeah agree with logan that the best thing to do is to read the resources i linked in the sparse coding summary, i think people have found robert huben's explainer (https://www.lesswrong.com/posts/a4oPE4xJqkYSz6jMS/explaining-taking-features-out-of-superposition-with-sparse) the most helpful. the basic picture is that you learn a simple autoencoder where the activations in latent space of the autoencoder are the feature activation levels
Is it true that the EV of the L1 norms are minimized when the learned are aligned with the real features? I know it seems to work empirically, but in the one example I worked it doesn't pan out theoretically. The example: the real features are the standard basis vectors in R^2, so your data is sample uniformly from the unit square. Take two choices of learned dictionaries: the canonical 2-element dictionary {(0,1), (1,0)}, and a 3-element dictionary which is at 45 degree angles to the canonical one {(1,1), (-1,1), (1,-1)}. Both of these can learn a perfect reconstruction of the data, and when you learn that the 3-element dictionary actually has a lower L1 penalty term (by ~8%). I'm legitimately confused about why a rotated basis is optimal here, but the experiments seem to find the canonical basis. It might be some combination of 1) learning canonical features requires fewer features, 2) even if rotated beats canonical when constrained by perfect reconstruction, if you trade off with reconstruction error canonical is better, 3) its sensitive to sampling space, and something in the many correlated dimensions/activations matters, 4) canonical features are easier to learn for some reason, and the training gets stuck in this configuration which is "suboptimal" from a loss function perspective but optimal for our actual goals
Had some thoughts about tied embeddings: if there is no bias term, they are piecewise-positive transformations. Meaning that when you partition the domain by the hyperplanes where the ReLU terms switch on/off, on each subset of the domain they are given by x-> (M^T)Mx, and the M matrix in each section will be the tied embedding matrix with rows zeroed out depending on which ReLU terms are active. Positive transformations are nice because they have an orthonormal basis with respect to which they are a diagonal scaling. However, I think the orthonormal bases of different pieces don't have to agree, nor do the bases have to align with the planes which separate the pieces. Finally, you can roll the bias term into the learned tied embedding in the usual way (by replacing the vector x=(x_1, ..., x_n) with (x_1, ..., x_n, 1)), and when you tie that into the matrix, you need to not score the L1 activation or the reconstruction loss of the last component, but otherwise you can store everything in a big tied matrix (sorry if thats unclear).
How does this look under spike and slab distributed basis vectors? Also, I guess what we are trying to find with sparse coding is the most sparse factorisation of the activation distribution, under the assumption that that is more interpretable. Interesting that the algo learns (0, 1) (1, 0)
how does this work? this seems to me to be like the claim that you can do a shorter trip by taking 2 sides of a triangle than one
i think this might be because you need to rescale (1,1), (1,-1) and (-1, 1) to have unit norm
Every representation has an advantage in encoding things closer to its basis vectors. For instance, with the 3-element dictionary, you have a lower L1 cost to represent (1,1), since you can take the (1,1) vector directly there (after normalizing that vector, you end up with a L1 cost of sqrt(2), in contrast to an L1 cost of 2 if you go along the canonical basis vectors). The canonical dictionary elements are more cost-effective near the axes, whereas the 3-element dictionary is more cost-effective near the line y=x. And given this sampling space (uniform across the unit square), the 3-element dictionary just barely squeaks it out.
I was already including that in the calculation (I gave the unnormalized vectors for ease of writing, but I should have said that). To represent (x,y) with the standard vectors, your L1 cost is x+y, but to represent it with the 3-element dictionary the L1 cost is max(x,y)*sqrt(2). Here's a spreadsheet showing the 3-element dictionary having lower average L1 cost (sampling points from the unit square): https://docs.google.com/spreadsheets/d/1rsEbKy_16qwOGguw0Vbxf60Nogkmqmyw6961w4ytbfE/edit#gid=0
Uniform Frequency
Datapoint #,X,Y,L1 cost to reconstruct (canonical features),L1 cost to reconstruct (rotated features),Relative Frequency,Average L1 cost (canonical),Average L1 cost (rotated)
0,0,0,0,0,1,1,0.9642365198
1,0.1,0,0.1,0.1414213562,1
2,0.2,0,0.2,0.2828427125,1
3,0.3,0,0.3,0.42426406...
What do you mean by spike and slab distributed basis vectors?
like p probability of being 0 otherwise uniform
I've set off a couple runs testing all combinations of the following parameters:
- tied vs not tied
- l1 coef \in [0.0031622776601683794, 0.01, 0.03162277660168379, 0.1]
- bias l2 decay \in [0.0, 0.05, 0.1]
- dict ratio \in [2, 4, 8]
should be good in a couple hours or so (guessing?)?
https://wandb.ai/sparse_coding/sparse coding/runs/ybxhr7hf <- wandb for the run, a bit jank because im multiprocessing it improperly and not syncing, basically the indexes are ~meaningless
ah gotcha sorry. with the way you're generating the data, there's nothing particular special or sparse about the basis dimensions (0, 1) or (1, 0) though so it makes sense to me that you wouldnt expect them to minimize l1 loss. if that were true we'd just learn the identity function at every point
if there was a high likelihood of only having X or Y active (which would happen if they were sparse) then you'd recover (0,1) (1,0) i think
Ah, sure. I added a third tab to that sheet where the frequency is increased along the axes, and the canonical basis performs better there. I think that's what aidan meant as well.
Residual stream?
nice this sounds super good, let me know when and where and i'll look at auto interp on some
if it's only a couple of hours, can you run the same sweep across all layers and mlp vs residual?
also i worry those l1 coefs might be a bit high, at least without any reinitialization
I would nail down the l1 first, which is typically the same across layers (though a unique l1 for mlp and another for residual)
Oh, I think the L1 is different for tied & untied. I agree that these l1 values are ~~too ~~a bit high.
For
MLP tied: 6e-4 (1e-4 is identity)
Residual tied: 3e-3 (which Aidan is indeed checking, though I'm unsure how the bias will interact)
MLP
ah, ok!
I'll let this run conclude then do one looking for L1 values
also my time guess was off by a factor of 2, probably can rewrite for more speed somewhere
I think I'm bottlenecking on one GPU looking into it
It looks pretty rad so far. Thanks for working on it!
as in it's not parallelizing across the 8 at all atm? or just it's over using 1
it's overusing 1 and it has to wait periodically to sync kinda
doing some experiments to see how ability to use superposition varies with residual dimension
lol i was gonna follow that message up with a graph but then i realised it had a big flaw 😅, hopefully will have something tomorrow
Lovely! Thanks
I guess I'll wait until you have results, but interested in elaborations!
How hard would it be to integrate the MCS plots? Or at least, I'm not seeing them in wandb
l1 and recostruction loss seems to be monotonically decreasing which is better than i was often seeing before
bit odd that we're seeing periodic spikes, i guess at the beginning of a chunk..
@bitter turtle is there any way of taking the saved dictionaries from your run yesterday back into the original class?
Getting some pretty bad perplexities for replacing the model w/ the dictionary reconstruction for pythia-70m-tied layer 3:
Dict Size | Perplexity | Reconstruction Loss
512: 180.98 0.0964
1024: 152.60 0.0870
2048: 127.85 0.0804
4096: 111.12 0.0763
8192: 104.80 0.0753
full model: 25.11 0.000
And the perplexity code is pretty simple, so I think I coded it right (& pythia 410m got ~11 perplexity, which makes directional sense)
ok that's interesting. not super surprising though it would have been great if we didnt see this
which dicts are you using?
Oh, sorry I sent it then (jk jk)
It should be specified above(?)
Oh, it's actually layer 3, I can link the aws location. Any other identifying info?
just seeing if they were the ones you trained or aidan's recent ones
I updated my messaged to include the equivalent reconstruction loss, so we can compare w/ Aidan's runs (though his is MLP, and this is residual stream)
ok thats v interesting because just eyeballing some of aidans runs, we're seeing recon loss about an OOM lower
It is MLP, but I'm unsure what else would be different besides the bias l2 decay
i don't know what would be different either (tho i think its possible bias l2 is actually v important for preventing dead feats) but the loss curves look waay more stable, i remember seeing that they seemed to plateau pretty hard and even start rising
Probably the untied parts? Nope, just checked & it's also low
It might just be the MLP. I expect reconstruction to hurt perplexity less in an MLP layer too, but I would definitely like this same run for residual stream!
Yo @bitter turtle , could you set off a run for residual stream? I can also look through your code and try if you're not able to.
we should really learn how to run the train loop haha
I think I've got it. I am copying over the output from the last run, cause I think it'll overwrite it
Aidan ran "big_sweep.py", and I think I just need to set "use_residual" to True
Can do tomorrow I need to sleep, haven't worked on anything today sorry
Nooo problem. I'll give it a try for an hour and give up if not.
@pallid current , I ran something at https://wandb.ai/sparse_coding/sparse coding/runs/h4zcf2jq
ok so reconstruction loss already seems to be like half of he ones you quoted above??
Yep, and there are different l1 values (from 0.003 to 0.1), which means that the higher l1 value should suck in reconstruction, but it doesn't?
I haven't figured out the naming scheme for l1 values yet
Slight hitch: I didn't delete the activations_data folder, so it re-used that and used the mlp data & maybe even the mlp-sized model(?) Re-running
It ran for 7 times (out of 30 from the MLP) Then I got:
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
The reconstruction losses still look good, though I'm still confused on what's the l1 values correspond to.
hey, this work looks unbelievably high value. I look forward to grokking it and seeing if there's any way i can contribute, even if it's just by getting more people looking at it
i had a call with alex turner. He's gonna look more into dictionary learning too
this isn't the actual error message, this is just saying that it crashed unexpectedly. If you scroll up a bunch you might see a better error?
lol, well it's gone now. We'll figure it out eventually!
I'm off for the night as well. I am having difficulty wrangling the learned dictionaries (since it's separated by encoder, decoder, encoder-bias).
I could code up something to handle it tomorrow, but if it's not too hard to save the whole autoencoder, that'd be preferred for me.
Glad to see you here! Feel free to ask any questions here:)
instead of going through the hassle of matching encoder to cfg, this eve i ran an experiment suggested by @worldly hinge where i clustered the directions and then ran autointerp on the features in that cluster. ran on 10 clusters, 3 clusters just never activated, 1 seemed to consistently give v similar explanations ('hyphens') and the others were at least somewhat varied
most fun one was a cluster of number features ! :
421: 'instances of the number one.'
559: "the digit '3' in the text."
1112: 'instances of the number 5.'
1459: 'numbers, particularly two-digit numbers where the second digit is a high number.'
1503: 'numbers, particularly single digit numbers.'
1744: ' numerical values, especially those used in a counting or sequence context.'
1954: 'the number 4 in the text.'
full results:
@keen pivot for the similar looking cluster, the hyphen one and also the one that centres around 'was'/'are'/'is' where there are some near identical autoexplanations i'd be interested to see if you find that they have distinct meanings
Oh that's actually a sick idea
@pallid current did you manage to run it?
@keen pivot I'll rewrite the saving code today, so that it saves in a more intuitive format!
What's like the size of these clusters (in max cosine dim and number of directions or some other metric you think will be better)
Or maybe 'what part of the unit sphere does each cluster enclose' or something
Just trying to get a feel for how spaced out they are
Which dictionary is this? If you can link the path on the instance/node (what’s the term?), I can load from that.
right got better saving (it's seeminly very slow now for some reason?) should I run a residual stream run @keen pivot? What parameter settings?
I do want a residual stream one. Thanks! Same parameter settings, but l1 shifted lower by 1 (which should translate to -5 instead of -4, if that makes sense?)
oh ok!
I think Hoagy wants an MLP one
It’s okay if you have favorites
pahaha
Hmm I only get 8 chunks, I think we just might be reaching max data for pile10k?
@keen pivot can you remember how many chunks you normally get for residual stream?
We could do the pile’s first shard
I can’t!
also true
current format is each model is saved as a dictionary {"params": {"encoder": ..., "decoder": ..., ...}, "buffers": {...}}, and you can check hyperparameters in a JSON file hyperparams.json saved with the models @keen pivot
accidentally goofed, 10m
a problem (maybe): I’ve noticed a discrepancy between transformer lens and transformers library precision of Pythia model residual stream, which is unclear how much it’d effect our results: https://github.com/neelnanda-io/TransformerLens/issues/346#issuecomment-1641576171
Lee pointed me here and I have a bit more time free to contribute. I plan to read up on your posts so far, but as I understand it, the high-level idea is to take a trained LLM which has features which may or may not be in superposition (have you thought of how you'd measure this?), and then training sparse autoencoders to recover features. Is that the gist of it?
If you're still thinking of trying a sparse VAE I'm happy to contribute there, too!
- if there's currently any problems/challenges you have I'd love to hear about them
@keen pivot should be in /mnt/ssd-cluster/resid_layer_2_19_07
Yep! I'm unsure how to measure the amount of superposition, but there's definitely feature packing (which may be superposition).
One argument: there's 50k vocab items which are then embedded into 500 dimensional space; probably feature packing there.
Another: optimality - more optimal to pack features as long you as you have sparse features.
Best argument: wes found features in superposition in his paper (neurons in a haystack)
Yeah couple things about this (measuring superpositionness directly):
- we don't really know what variances to expect for activation tails, and the variances for different features vary WILDLY
- someone mentioned finding directions with high kurtosis (as a proxy for how spike-and-slabby the distributions of activations are)
honestly don't see much point measuring it 'directly' since that would probably amount to 'see how well something shown to fit on superposed data fits on activation distributions' which is just training SAEs
but then again rough thoughts would be interested to hear proposals
I think the idea I had behind measuring it is more for scientific purposes - being able to qualitatively measure the degree of superposition would let you say a particular technique for sure takes something out of superposition, or reduces superposition by X quanta in this LLM. Then you'd also be able to look at, for example, how autointerp methods relate to superposition, and go on to using it in your model selection criteria
i.e. we've trained these two models but this one has higher superposition and we can't understand it as well, let's not deploy
(also spitballing here)
I mean, afaict 'superposition' isn't a sufficently rigorous definition to be used like that; what we are really testing here isn't 'does a model do TMOS-style superposition' but rather 'is the abstraction detailed in TMOS a useful one'.
agreed
Like you could measure 'max spike-and-slabbiness of the distribution of activations over rotations' but that feels unfounded and too-many-holes-having.
holes being degenerate cases
dunno why I said holes
anyway my take is that the metric for deployness or whatever will be something like 'how accurately can we abstractly describe functionality' or something which isn't neccesarily directly superposition-related. Like, I feel like if we can measure superposition there's a good chance that that mesurement method also has an unsuperpositonification mechanism as an immediate corollory
spelling
I think this is the most compelling real-world example of TMOS-style superposition I've seen so far which I'm sure you've seen too (if you have more please send!) https://distill.pub/2020/circuits/zoom-in/#claim-2-superposition. I think it's important because it relates inductive biases in model representations to downstream tasks, which is the real measure
(and I agree, I think superposition is a useful concept but doesn't directly relate to an atomic, measureable phenomenon)
I'm not sure how strongly to take that evidence. I vaguely remember hearing somewhere that there is some nuance in how they generate those images. Also, that seems to mostly be entanglement (as in, viewing the activations in the 'wrong' basis) which is something you also see in e.g. VAEs
Point taken! Though wdym with entanglement?
like, rotation as opposed to 'compression'
In general when I'm thinking about superposition I'm thinking about it more as a useful lens to view activations rather than something stronger and natural-abstraction-hypothesis-assuming
so like, stronger than Nora's views and less strong than Olah's I guess.
Inductive bias analysis thing is an interesting point though, let me think about that for a sec
Ok, I'm not sure how you'd measure that without also measuring superposition.
Could you expand more on what you envision?
One grounding of amount of superposition is how many features our dictionary learns w/ eps-diff in perplexity.
perplexity? Like perplexity under intervention with reconstructed features?
Not sure how that would work; surely if there is some subspace our SAEs fail to describe then there is a minimum perplexity gain
The hyperparams look great, thanks!
Phew
Is there a way to easily download the model given the .pt file?
it's just saved as a dict
(a lot of papers in representation learning talk about this idea, https://arxiv.org/abs/1811.12359 is a good one, though the slant in representation learning is a bit different, imo it seems a bit nicer if all we want to do is understand the representations, rather than learn useful representations like world models)
I have torch.load()-ed it, but was hoping for a one-liner for
autoencoder = ...
atm, I can define an autoencoder and assign each relevant part to the part in the dictionary, which is doable, but I may be missing the intended way to load it.
Ah, no, not yet sorry
I was referring there to the authors hypothesising that models find it useful to store some less-important features in superposition rather than dedicating e.g. an axis-aligned dimension to it
I think people generally call non-axis-alignment 'entanglement' and the compression thing 'superposition', and also where are you referring to here? The distil.pub paper?
ye
No problem! Just making sure I wasn't missing anything
I'm used to representation learning-lingo and there's lots of terms being used in the alignment space w.r.t disentanglement/representations that I'm not fully used to haha
Yep no idea same tbh