#Sparse Coding

1 messages · Page 2 of 1

pallid current
#

oh duh lol soz

keen pivot
#

Where’d you hear this?

pallid current
#

ml engineer said it's a common thing, tryna find a paper or something that shows it but no luck

torn star
#

It's because the model has already updated its weights based on this data. So you should expect it to be better.

pallid current
bitter turtle
#

couple things,
is Z(i) supposed to be {x \in domain : ReLU(Wx + b) is 0 in the ith component}?
If you start with two features with Z(i)=Z(j) how do you get that the Z(i) are still disjoint from the domain?

bronze wraith
# bitter turtle couple things, is Z(i) supposed to be `{x \in domain : ReLU(Wx + b) is 0 in the ...

re 1: Z(i) is the hyperplane in which the ith relu switches over from activating to not activating, and its useful to talk about that in general, even if it doesn't intersect the domain of reconstruction
re 2: if Z(i)=Z(j), it may be the case that the Z(i) intersect the domain. however, I conjecture that when this happens, you can "cancel" these features and improve your L1 loss term. As a trivial example: take any reconstruction, then add on two features f_1 and f_2 which are negatives of each other, and which activate equal amounts (ie their rows in the W and b matrices are equal, but their rows in the D matrix are negatives). These have the same Z(i) planes, which can be anywhere, and they completely cancel out in the reconstruction, but they add unnecessarily to the L1 loss on the c term, so they should be trained out

bitter turtle
#

cool that clarifies a bunch thanks

bronze wraith
#

Okay, bad news: I found a simple example where you can get a perfect reconstruction with better L1 loss by using non-canonical features than by using canonical features.
The example:

Sample x and y coordinates iid from the uniform distribution [0,1], so your domain is the unit square in R^2 (and thus your canonical features are (0,1) and (1,0)). Rediscovering those features, you get a perfect reconstruction, and the L1 loss at (x,y) is |x|+|y|, which has an EV of 1 over this distribution of points.
Alternatively, learn these weight and bias matrices:
W = 1/sqrt(2) [1, 1 ]
[1, -1]
[-1, 1]
b=0
D= W^T (the transpose)
Then this also gives a perfect reconstruction, and L1 loss at (x,y) is max(x,y)/sqrt(2), which has an EV of 2sqrt(2)/3≈.942<1 over this distribution of points.

#

(This is basically Hoagy's example from before. Note that my previous theorem doesn't apply here, because Z(2)=Z(3))

pallid current
#

got autointerp working with the nano model on ICA (which works well) and random directions (which doesnt)

bronze wraith
# bronze wraith Okay, bad news: I found a simple example where you can get a perfect reconstruct...

More generally, if you have an invertible matrix M, and W consists of M stacked on top of -M, and D consists of M^-1 stacked on -M^-1, that also gives a perfect reconstruction. This works because you're "cheating" the ReLU via x=ReLU(x)+ -1*ReLU(-x).
However, this can be detected by looking at minimum cosine similarities between the features in the dict. If the minimum cosine similarity is -1, that means this sort of cheating is happening.

bronze wraith
keen pivot
pallid current
#

i think over the 10 features, random might actually have scored the best.. 🤔 🤔 small sample size etc but still.. odd. i would expect it to still do quite well at finding a pattern in top activations, but would expect very poor scores in the random part of top-and-random scoring. need to make sure that random part is actually happening

#

if random isn't low then it puts into question the validity of finding additional interpretable features with sparse coding. might need to move straight to pythia

upper basin
keen pivot
keen pivot
#

@bronze wraith I need your math help. If I want to ablate a feature, it makes sense to ablate the direction specified by the decoder; however, for the encoder, there's a negative bias, ReLU & mostly-postively neurons (because of GeLU in MLP).

This means there isn't a "neuron direction" for the encoder. Ex. If I have a bias of -1, 2 positive neurons, then (2,0), (1, 1), (0,2) all activate it, but this isn't a direction (or vector I can ablate)

bronze wraith
keen pivot
bronze wraith
keen pivot
#

I'd like to see the causal role of a feature. One way is to ablate it.

My current method is to subtract by (feature vector*feature activativation) w/ vector specified by the decoder)

#

This may just be the best method, but it does rely on the reconstruction being high-quality, which I haven't checked yet.

bronze wraith
keen pivot
#

Logit difference in output true tokens

#

Input: vol. 1 at www.boost.com
feature activates on "www"
Effect on output: Lowers the logit assigned to token "boost"

keen pivot
#

While I'm doing this ablation stuff, I'm going to investigate just training on lots of data over lots of epochs.

pallid current
#

anyone know anything about using pandas with big data? main bottleneck atm is that it's taking outrageously long amount of time to either save df to csv, or to convert df columns which are lists into strings which then save fast

torn star
pallid current
#

got some results comparing neuron basis and random transform on the tiny model and got this 😮

#

basically i think this means that there's no use doing work on the nano model

#

unless i've bodged something, which is obv v possible

#

replacing with graph with means

#

these are autointerp scores using gpt4

bitter turtle
#

eek

pallid current
#

I'll run the same test on Pythia tomorrow and see what we get, I don't expect the same results

keen pivot
bitter turtle
#

Ok so i ran a test on non-repeated data on the residual stream of pythia and hesitantly I don't see the MCS dropoff you see when training on the post-activation-in-mlp dataset

#

mini-batch increases as you go rightwards. each mini-run is ~2M datapoints = 2GB data (residual stream width = 512)

#

also running a repeated-data run atm for comparison

keen pivot
#

I believe I was running at 3e-4

#

You seem to be getting best results at 1e-3

#

Is this the default layer 2 for Pythia?

bitter turtle
#

residual stream post-computation at layer 2

#

so like, data coming out of layer 2/into layer 3

keen pivot
#

Okay, so this is great. Like if the only reason it didn’t scale is because I picked an awful l1 value, then we’re good!

bitter turtle
#

I'm very uncertain about this, but I don't think we should expect a priori for SAEs to work on MLP-post-activation data, TMOS assumptions might not hold well and SAEs are fairly heavily predicated on those to work

keen pivot
#

What are the assumptions?

#

Also, it’s very cheap to just run two experiments on the pre and post

bitter turtle
#

just the whole 'data mostly explainable as a bunch of sparsely-activated features which interfere slightly' thing

keen pivot
#

I can run a few to check if there’s a better l1 than 1e-3 for both pre and post

#

And which one is better.

#

Would still be interested in links for SAE assumptions, but again the test is pretty cheap! (a couple hours run)

bitter turtle
#

same thing for repeated data; it's pretty much the same afaict

bitter turtle
#

in general I think we should be looking at the residual stream

keen pivot
#

The residual stream would also be cheap, though harder to compare apples to apples with the MLP

bitter turtle
#

but like, I don't see why we should expect SAEs to work at all on the MLP since I don't think the hypothesis 'MLP activations look like a sparsely activated overcomplete feature set' is particularly good or true, nor do I see why we should expect that a priori

#

like, obviously it has to have some similarities since it is writing to/from a sparse overcomplete feature set (the residual stream) via a linear map but I can't tell how similar it should be

#

it's also not clear to me why we should expect the MLPs to 'output' significantly more features than the dimensionality of the MLP

#

Might get to work implementing @pallid current's toy model tomorrow

keen pivot
#

I'm currently replicating your results & gonna try to find a better l1 value if it exists, & train on that for lots of data.

Additionally, will run one on the residual stream

bitter turtle
#

Also I think this is kind of max-data for pile-10k for residual stream activations (when it's truncated to 256 tokens/line + standard max-senences or whatever, I don't really know how the dataset is configured)

#

Like I got 8x2GB chunks out of it max

#

you could mess around with the truncation or do a different dataset

#

(since the residual stream dataset should be 1/4 the size (in bytes) of the activation dataset @keen pivot)

keen pivot
keen pivot
keen pivot
#

Oh I understand everything now

#

Sorry

bitter turtle
#

moderately confused

keen pivot
#

I thought you were doing mlp the whole time. I just mis-read your message initially.

#

Replicating now: Residual stream:

#

I do think there may be a slighly better l1 value. The sparsity of 1e-03 is ~0 and 3e-3 is ~20 (so 20 features per datapoint on average, which still seems high)

#

Looks like 3e-3 is great! I can a more fine-grained experiment later if it should be 2 or 4, but mostly looks good. Running a larger run w/ larger dicts on that l1-value.

keen pivot
#

@bitter turtle There is a degradation in residual stream (top one is later). This is after 8 epochs of 8 chunks (=16GB) each w/ refreshed data every time

#

Also, I'd expect it to not really plateau here at 30% of 512 (the residual stream size), because the toy model didn't. But this may again be a data diversity thing, or a "we should train it on multiple epochs" thing, or something else.

bitter turtle
bitter turtle
#

Also, I used batch-size 1024

keen pivot
keen pivot
# bitter turtle Yeah, what did the toy model plateau at?

Usually ~1, but sometimes lower for larger sizes (I vaguely remember 80%, but never something like 30% which is what we're getting. Additionally, the toy may just be caused by the way we scale feature frequency which would make additional features even less likely)

bitter turtle
keen pivot
#

I may or may not be downloading it on 4 different rented GPU's to learn dicts for all Pythia layers.

#

I am currently committing to the MLP post activation for Neel stuff, but next week, I'd like to investigate more into residual stream if you haven't solved it all yourself by then!

bitter turtle
keen pivot
bitter turtle
#

What I meant is I'm not sure how significant a change the 2.95 to 2.91 is

bitter turtle
keen pivot
#

Notice something interesting. It seems the best l1 value changes. Here the top one is later & 3e-5 appears to do better than 3e-4 (which starts to degrade/stagnate). This is running on 20GB of pile for Pythia-layer-4, repeated for 5 mini-runs. The above is between mini-run3 & 4 (w/ 0 indexing).

#

This is important because one way to pick an l1 value is to just run it & see which one does better, but here, you don't see it until 20GB*5 repeats in! This replicated across layers 1 & 5 (I'm still waiting on layer 3, which is just taking 2x as long as the others 🤷) Edit: Layer3 also has the same behavior.

#

In response, I'm running them all again for 60GB repeated 5 times (3x the GB) to see if optimal L1 shifts any more left. The idea here is to just spend a lot of compute & maybe backtrack how we could determine optimal L1 from other corollaries.

#

I'm also intending on saving to aws every mini-run, so any of y'all can see the dict. Additionally, we can check MCS between dicts at different L1's to see if there is a better l1 value for every size dict.

#

Thanks @pallid current for coding up the mini-run code, wandb, & saving to aws. You're amazing!:)

keen pivot
#

Super weird jump to 70% of 2k features in layer 1. That's the most features I've seen.

#

Just wait until I wake up & these runs are done. This isn't even my final form.

pallid current
#

i can try testing those runs on our new stronger cluster soon, though need to write some parallelism code.

#

@bitter turtle could you add a PR to add an option to use residual stream please?

pallid current
# bitter turtle but like, I don't see why we should expect SAEs to work _at all_ on the MLP sinc...

here's a recent paper which uses sparse probing on the MLP output to find basic features, i think this is the closest thing to motivation for the MLP work that i know of https://arxiv.org/abs/2305.01610

pallid current
#

if we're doing that we should also really think about how what we're doing builds on, rather than just replicates, the transformer factors paper that i'm always posting https://arxiv.org/abs/2103.15949

pallid current
pallid current
keen pivot
keen pivot
#

This is layer 1 though. I've normally been doing layer 2, so I don't know the effect there.

#

Layer 4

#

Layer 5

pallid current
#

waaaaait, is the MMCS where you remove the high MCS feats screwing something

#

like you've removed all the feats in the 4K dict?

keen pivot
#

This is just our normal code

pallid current
#

yaaaaa but like, no way that first graph you sent is right

#

suuuurely not

keen pivot
#

I do do (lol) 1024 batch_size

#

So 🤷

pallid current
#

ok i see you're not like removing the feats from the dict in the MMCS code (i dont understand the hungarian thing)

#

but like......... wtf, can we see the histograms?

keen pivot
#

Layer 4:

#

Layer 1:

#

Like, I think this works then. Of course I need to do more checks on the actual features, and we'd need to figure out how to get consistant, but like: just use a lot of compute to overcome our ignorance for proof-of-concept

#

Layer 3 is still like veeeery slow. I'll let it finish out, but man, it's only done 2/5 mini-runs & everything else is done:

pallid current
#

what do the other metrics look like for the 8k and 16k dicts?

pallid current
#

so that's 2000 high MCS feats for a 2k dimensional MLP?

keen pivot
#

Like every mini-run

keen pivot
pallid current
#

yeah agreed

bitter turtle
bitter turtle
keen pivot
bitter turtle
#

Left is run 1, right is run 2?

keen pivot
keen pivot
#

One thing to note w/ residual & MLP is the sparsity of the MLP w/ tiny l1s (where layer 1 did best) is ~800 or ~400 for 2k & 4k respectively. Sparsity is calculated as average features per token. That's just crazy. Not real.

Residual stream has like 2-3, lol

pallid current
#

high features per token is (unsurprisngly) connected to the low l1 val, it goes way lower for the higher l1 vals

keen pivot
#

Ya, but like, high MCS? So there's at least a converged way to learn to "sparsely" reconstruct between the 2k & 4k dictionaries.
Edit: They could both learn the identity. Expecially w/ such a low l1 value. I can check this!

#

Also, hoagy, you know how to download aws bucket stuff from command line? I tried
aws s3api get-object --bucket DOC-EXAMPLE-BUCKET1 --key dir/my_images.tar.bz2 my_downloaded_image.tar.bz2, equivalent for mine, but it gave me a Syntax error in their code 😢 (their's does python2).

I can look it up more later (atm just manually downloading locally, then uploading); not a bottleneck!

pallid current
pallid current
#

i also ran 50 auto interps random vs neuron_basis on pythia70M, still no distinction 🤔

#

now very confused, might be a layer thing, might be a bug i dunno

#

anyone got a good recent pythia sparse coding run to do a comparison to?

pallid current
#

honestly im so confused by the recent set of results, i'd really like to talk it through with someone soon

bitter turtle
bitter turtle
#

@pallid current I've submitted a PR for vectorised ensemble training for SAEs, should be more performant and gpu-utilising, the merge looks a bit horrific, I'll let you deal with that 😬

#

I'm also going to do the transformers toy model in a different repo, i cba to deal with the merge conflicts atm

#

@keen pivot if you have the energy to change the training loop code (I don't unfortunately), the vectorised code should be a lot better for hyperparameter tuning

keen pivot
bitter turtle
#

Hi @keen pivot @bronze wraith @pallid current, I think we should have a chat about standardising some metrics at some point in future (I want to establish some sort of principled way to reason about goodness-of-extraction-procedure, and maybe implement a testbed for comparing extraction procedures robustly); also @pallid current said something about standardising PRs, tests, etc

keen pivot
#

Found a good example to show the "features get lost over time" thing

bitter turtle
#

What's this on?

keen pivot
#

Layer 1 Pythia. 20GB*20 times (fresh data though)

bitter turtle
#

MLP?

pallid current
#

@keen pivot what's your situation with neel and final projects?

#

i think we should do a meeting on monday to plan what we need to do next because it's starting to feel scattered again

#

my tentative plan is that we should identify all of the variables of interest, do a bit of work to make the code more parallelizable, and then use the eleutherAI cluster to do a much broader sweep than we've done so far, so that we can have a single pile of data to look at and draw conclusions from (and then presumably branch off a bit)

keen pivot
#

Will be done today (interview is tomorrow, hear back on acceptance Monday)

pallid current
#

good luck!

bitter turtle
bitter turtle
keen pivot
bitter turtle
#

On gpt-2 or pythia?

keen pivot
keen pivot
#

I think I'm done. I can't make changes past 9pm my time, but feel free to look at it!
https://docs.google.com/document/d/1KqHe9NL9NuJ_yaKJc__eX6kjBtRcROePoLApjOhGBUU/edit?usp=sharing

keen pivot
#

Okay, I've got a few sources of evidence that the Pythia dictionary that learned ~2k features actually did converge on identity:

  1. Several features do look polysemantic like normal neuron interp
  2. Looking at one feature, it only has 1 neuron that activates >0.3 for the same max-activating datapoints & 1 outlier positive weight in the decoder (ie activating it only reconstructs one neuron)
  3. The L1 was way lower than expected
  4. Sparsity was crazy high (400 or 800 features/datapoint)
keen pivot
#

@pallid current @bronze wraith @bitter turtle I'm getting meaningful, monosemantic features for even low-MCS features in Pythia-mlp-layer-1. This was trained on 60GB*5 (repeating data). For MCS-above 0.9, it went from 55% to 45% during training, but looking at several features, they all seem meaningful.

pallid current
#

Oh sweet, if you send me the dict I'll run a comparison on autointerp with random and neuron badis

keen pivot
#

How many GB is pile-10k?

pallid current
#

In activations?

keen pivot
#

I think. I want to compare when I do n-chunks=30 for Pile, I think that's 60GB.

keen pivot
#

From Hoagy earlier:

my tentative plan is that we should identify all of the variables of interest, do a bit of work to make the code more parallelizable, and then use the eleutherAI cluster to do a much broader sweep than we've done so far, so that we can have a single pile of data to look at and draw conclusions from (and then presumably branch off a bit)
Notably, we do already have a few dictionaries on the bucket to look at now, which may inform what type of information we're missing (to code and get for the next iteration of dicts)

bitter turtle
#

5pm is maybe a little early for me unfortunately, how is GMT-17:00?

keen pivot
#

For me, top of my head:

  1. Have evidence that the features we learned are indeed meaningful (intend to write an LW post)
  2. Evidence that dicts on pythia-layer-1 learns meaningful features for even low-MCS

Proposal: Can look at dicts that go low-to-high-to-low MCS, see if it ends w/ meaningful features & what happens

General Proposal: Look at "features across checkpoints" pythia style, w/ more frequent checkpoints earlier. I'm suggesting spending ~20 minutes/feature (after time spent writing functions) .

bitter turtle
#

How are you currently evaluating the meaningfulness of features (in the case without autointerp)?

keen pivot
# bitter turtle How are you currently evaluating the meaningfulness of features (in the case wit...
#

I mainly focused on gathering a lot of information that helps narrow down the hypothesis of what property of inputs that feature is activating on.

This can be repeated for the output (as in, we think the model has developed a discriminator for property X, which was useful for predicting Y, so gain info & test for that), as well as intermediate layer's features.

#

Btw. the doc is like 15 pages, mostly images, only 1k words. I can answer any questions about it!

bronze wraith
#

Hi @keen pivot , read the doc, very cool! Some thoughts:

  1. I had a bit of trouble following what was being shown in each image. Captions might help?
  2. One ways we could test robustness of our understandings of the features: after getting the text description (e.g. "period after www"), we/GPT write new text we think will activate the neuron, then run it through the LM again and test if we can activate the neuron on demand.
  3. I'm worried by the fact that our features are ~half punctuation/math/urls, it makes me think we're picking up just a certain kind of neuron, but "useful" semantic meaning is either not in this part of the LM or is less able to be found by our method.
keen pivot
pallid current
#

from looking at the neuron-in-haystack paper (https://arxiv.org/pdf/2305.01610.pdf) they also used pythia70M and found a lot of sparse, meaning-related action in the early layers, exactly the kind of thing that we'd want to be able to find

#

so i dont think it's just a scale issue

keen pivot
keen pivot
keen pivot
#

Like general french neuron or Go (programming language) neuron?

keen pivot
pallid current
#

i'm wondering whether we should use the neuron in a haystack approach to find some directions which we reckon are exactly the kind of thing we would hope to find, and try to check that sparse coding is actually something that's capable of finding it

keen pivot
# pallid current which ones are you thinking of? my general feeling looking at that is that they'...
#

I think some of these are pretty specific? (edit: e.g. just "x", & I know there are other char-level features like "w", & ", and") I also expect other re-tokenization words (like Harvard) to be here in layer 2 & also layer 1.

keen pivot
bronze wraith
bronze wraith
pallid current
#

but it doesn't do anything to rule out the possiblity of it also responding to other things

bronze wraith
bitter turtle
bitter turtle
keen pivot
bitter turtle
#

Haven't really been able to read through your doc unfortunately logan (just got back from a school meetup thing actually)

keen pivot
bitter turtle
#

Now is GMT-17:00 right?

bronze wraith
#

An approach I saw for dictionary learning: https://en.wikipedia.org/wiki/K-SVD

In applied mathematics, k-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value decomposition approach. k-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse coding the input data based on the current dictionary, and updating ...

pallid current
#

also i wanted to say before logan left, lee's pushing to get a meeting with olah and anthropic types over the next week or two

#

i'll try to get you guys included which should be possible, if not then i can grab questions from you here

bitter turtle
#

For targeting specific features we could literally just do feature learning on e.g. truthful QA activations

keen pivot
#

Nora convinced me that one metric to use is model editing metrics. Specifying which model editing metrics would be clarifying on its own, and also help compare different ways of training.

keen pivot
keen pivot
keen pivot
# bitter turtle Wdym

Any sort of model steering (like make the model more honest or only perform circuit X) can be compared with previous work on model editing.

bitter turtle
#

Oh right sure

bitter turtle
#

So like train on activations generated on e.g. truthful QA then select+edit features with the most variance between+consistency within true/false classes (basically VINC but manually on directions restricted to those found by sparse dicts)

bitter turtle
pallid current
bronze wraith
# bitter turtle this seems functionally equivalent to sparse autoencoders

+1 to hoagy's reply: it has the same inputs and outputs (activations and a dictionary of features, respectively), but the internal process is different. If we're lucky, k-SVD would be some combination of faster/make more accurate reconstructions/find "more meaningful" features compared to the autoencoder.

pallid current
#

still not sure why me and logan are getting different interp results but i am finding a different between neuron_basis and sparse coding on pythia layer 1

#

dotted lines are 2SD from mean

#

will increase sample size later, its busy trying to perform ICA atm

#

on the other hand, not much evidence of relationship between MCS and autointerp score

#

relationship might eventually prove 'significant' but high correlation looks unlikely

keen pivot
#

I can show that hopefully today or tomorrow!

bitter turtle
pallid current
#

making the decoder an affine transformation
what dya mean?

bitter turtle
# pallid current

do you still think this is a bug on your end or a more fundimental thing

bitter turtle
#

my intuition is that that should 'strengthen' the abilities of the encoder slightly

bitter turtle
bronze wraith
bitter turtle
#

oh shit yeah my fault i am blind

#

@bronze wraith where was the implementation you mentioned for K-SVD?

bronze wraith
#

Right now I’m playing with it and I’m finding it a little fiddly (it throws errors if your dimensionality is off)

bitter turtle
#

right, we might need to come up with a batched, streaming implementation of that if we want to scale it @bronze wraith

keen pivot
bitter turtle
bronze wraith
keen pivot
#

For auto-interp, atm it's useful for detecting monosemanticity in the input distribution (ie we assume GPT-4 coming up w/ hypotheses & GPT-3.5 creating accurate predictions in held-out text correlates w/ the underlying feature having a simple description across the entire feature activation range). This can be repeated w/ the features effect on the output & other layer's features.

But, this still leaves open the "interestingness" of features. Two desirable properties are:

  1. The features explain all behavior on the data distribution we care about (ie low reconstruction loss as discussed yesterday)
  2. We can simply express any feature we actually care about (e.g. deception, honesty, australian accent, etc) using these features (w/ "simply" maybe just meaning sparse)
keen pivot
# bronze wraith I think adding a decoder bias is approximately equivalent to centering the origi...

When I look at decoder weights, there's a lot used for reconstructing the negative neuron activations from the GeLU. This seems like it'd be generally true for all features learned, but would cause a problem if multiple features activate at the same time because they'd try to reconstruct the original distribution, but would overlap w/ each other.

The learning process would probably learn correlations for features so each feature only handles 1/N% of the job for reconstructing normal neurons, but there will be noise. I was thinking the bias might help here, but I don't think so.

bitter turtle
bitter turtle
bitter turtle
#

oh, you mean the model uses GELU instead of ReLU and the decoder devotes features to those small negative parts?

bronze wraith
# bitter turtle for residual stream data yeah, different for MLP thingies maybe?

Let me clarify why i said that: fix some bias v in the decoder. Let's say we: 1. shift all inputs back by v (x -> x-v), 2. update the encoder biases the undo this, and 3. remove the bias vector from the decoder. These transformations together should not change the l2 loss term (both your inputs and outputs are shifted back by v, which cancels out), nor the l1 loss term (the encoder activations are exactly the same). So for any decoder-with-a-bias, you can make an exactly equivalent sparse autoencoder without a bias, on a shifted input set, which has the exact same loss. (Epistemic confidence: high)

#

(Epistemic confidence: medium-low): I assumed that centering the dataset at (0,0) minimizes the overall amount of l1 activations needed to make a reconstruction, and therefore this centering would be optimal. But I might be wrong about this

bitter turtle
bitter turtle
keen pivot
keen pivot
bronze wraith
#

Oh, sorry one more thing about having a decoder with a bias: you can implement them in autoencoders w/o bias, though there is an L1 penalty. In particular, since the encoder has a bias, you have your encoder learn a feature that always activates exactly 1, and the dictionary element corresponding to that will be your bias. There is an l1 cost since youre always activating that internal feature, but if we specified some features as not having an L1 penalty, this would bypass that (i.e. instead of our L1 penalty being ||y||_1, we make it ||Py||_1, were P is a diagonal matrix of 0s and 1s).

bitter turtle
bronze wraith
bitter turtle
#

sick

keen pivot
#

@bronze wraith , could you give a concrete example of a feature that may be spread across two MLP layers in a Transformer? This is based off the "concatenate multiple layer's activations as input" which both you (& Neel) brought up.

bronze wraith
# keen pivot <@748975058415910923> , could you give a concrete example of a feature that may ...

How about "BERT Rediscovers the Classical NLP Pipeline" (https://arxiv.org/pdf/1905.05950.pdf )? They used linear probes on the internal activations of BERT to try to extract stuff like "part of speech", and find that this info is spread across multiple layers. E.g. this part of section 3.2:

We would like to estimate at which layer in the
encoder a target (s1, s2, label) can be correctly
predicted... A
naive classifier at a single layer cannot either, because information about a particular span may be
spread out across several layers
, and as observed
in Peters et al. (2018b) the encoder may choose to
discard information at higher layers.
(emphasis added; link to Peters et al, which I haven't read: https://aclanthology.org/N18-1202.pdf). In the attached image (part of Figure 2), the blue bars are showing which layers are important for correctly probing part of speech, and you can see that info is spread across several layers

keen pivot
#

Made my much better post! https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tentatively-found-600-monosemantic-features-in-a-small-lm
Update from last: I coded the quantile thing wrong, so now it does look like ~ 60 neurons (still maybe)

Also: Compared w/ the identity dictionary with specific feature seeming to represent only 1 neuron

Using a sparse autoencoder, I present evidence that the resulting decoder (aka "dictionary") learned 600+ features for Pythia-70M layer_2's MLP (after the GeLU), although I expect around 8k-16k featu…

keen pivot
keen pivot
#

One problem noticed when training the dictionaries at different layers is that later layers learned way less features & had more dead features (e.g. even 50% of 2k). Data:
Dead neurons (layer 1,3,4,5) (skipped 2 because I already had a dict for it when I trained, but trained on different data, so not including it):
0, 1.7k, 1.3k 1.2k (out of 2k)

High-MCS features:
46%, 4%, 14%, 13% (out of 2k)

#

Possible explanations:

  1. Other dictionaries need a different l1 value (not likely imo. I sweeped l1 values from 1e-5 & 1e-3 & MCS > 0.9 features plummeted for 1e-3)

Note: it is max for 3e-05, but I expect that's because of learning identity because features/token = 700.

#
  1. There are many more features in early layers (focused on grouping tokens & re-tokenizing certain pairs) than middle layers (higher level features?) and later layers (re-tokenizing features). The dead neurons are caused by iterating over the same dataset. Evidence for this here: https://wandb.ai//sparse_coding/sparse coding/reports/Layer-3--Vmlldzo0ODA5ODA1?accessToken=w6c775vurw0wtu80n3617dl41ts9wvqjfnm41uc4xj1pi0yc7vd009lu4ntu8zvr
#

If (2), then just running mini-runs w/ fresh data should show increasing number of features in layer 3 at that l1 value. I think I do a fresh-data comparison for layer-1, so I can check that now at least.

Update: can't tell. layer-1 has 0 dead neurons even when you repeat data, so can't tell generalization behavior to layer-3, which is the one I linked. Additionally, this is the layer that when trained on new data goes up to 40% (MCS above 0.9), then down to 3%. This also needs to be explained.

keen pivot
# keen pivot If (2), then just running mini-runs w/ fresh data should show increasing number ...

Possible explanation: The smaller & larger model are simply learning different features because there are so many, especially at such an early layer & w/ so much data.

Additionally may be a thing where dicts of different sizes are biased to learn different features (which was brought up before here), so a previously proposed test was to learn two dicts of same size w/ different initializations.

Additionally, I could look at the features learned by the dictionary. If low-MCS features are meaningful, then seems true.

bronze wraith
#

Update on ksvd: when I compared it head-to-head with sparse autoencoders on the toy data, it failed to reconstruct the original features. The MMCS was something like .266 instead of .999 for the sparse autoencoders. Tomorrow I'll try tinkering with it to see if there's a fix!

keen pivot
#

@bitter turtle , I'm going to get to it soon (like tomorrow or the next?), but if you'd like to see if the currently learned features across layers have meaningful connections/circuits, it'd be great to have another set of eyes here.

keen pivot
bitter turtle
#

Just send over the dicts you're looking at ig

bitter turtle
bronze wraith
# bitter turtle Have you checked the dictionaries K-SVD is learning? OMP learns positive and neg...

I think you're right its using sparse (positive and negative) combinations, instead of sparse positive combinations, and the issue is probably there. I'll try looking into positive versions of ksvd or see if there's a workaround. Oh, and one thing I should do is look at absolute value of cosine similarity, because it might be learning -1*feature (which would be great but would be ignored by mmcs)

bitter turtle
#

just replace the calls to orthogonal_mp_gram with calls to this in the ksvd thingy

pallid current
#

more results from autointerp, this time on gpt2 small, still the neuron basis totally failing to beat the random baseline but much higher scores overall. i only recently noticed they use layer 10 of gpt-2 small for their autointerp comparison so i'll run that next

#

it's not clear why they used layer 10 of small when the rest uses XL, might well be cherry picked

bitter turtle
#

Gah, I've spent the entire day trying to implement a nonnegative version of K-SVD for the GPU/PyTorch, only to learn that most non-negative least squares algorithms are like deeply not designed for GPUs. Some people have written CUDA kernels for them, and I could do it with GD, but I'm probably going to throw in the towel for the moment and try and do more useful things tomorrow.

#

If @pallid current could set me up with the 8xA40 rig maybe I could get to setting up some generic parallelised code for the big sweep?

keen pivot
#

Checking the claim "Low-MCS features are meaningful":
Layer 1: maybe slight correlation w/ MCS & meaningfulness by Logan's standard. Important difference here is layer 1 has ~50% features learned. Also, the low-MCS features felt lamer. https://docs.google.com/spreadsheets/d/1BnFaqn8W9aM1rlosiYFG64VVeeYQNJWFYM-Qt4qj-0o/edit?usp=sharing

Layer 2: (in the LW post, clear correlation, but dominated by dead features)

Layer 4: Also dominated by dead features, but clearer correlation w/ MCS. For the top-MCS features, 75% monosemanticity, whereas low-MCS features are 25%.
https://docs.google.com/spreadsheets/d/1DaPl4sm7KvKr2eVtf2DGSFXLWQyxZaommXm6F1uSnRw/edit?usp=sharing

#

I also may have noticed a pattern in feature vectors that make sense vs those that don't. Ones that don't have a fairly symmetric weight distribution, whereas "real" features have longer tails.

I could show this by plotting MCS by mean & std of the weights., lol nope. No correlation at all.

I believe this will be mostly correct, but misleading because different neurons have different activation distributions, so plotting weight histograms by quantile. I'll also need to multiply the weight by the max-activation of that value I think.

bitter turtle
#

God I am falling down so many rabbit holes for different sparse factorization mechanisms.

pallid current
#

more graph posting, here are the runs from layer 10 gptsmall, this one really should match the results in the direction-finding part of the openai autointerp paper. instead the results are much better, but also dont show the same size of different between neurons and random directions

#

@bitter turtle let me know when you're free today and i'll get you set up on the eleuther compute, i'll be in the office in about an hour

bitter turtle
#

cool

#

free most times

bitter turtle
pallid current
#

Hey Aidan, sorry had a meeting earlier so stayed home for a bit, on the train in now, will be there around 12 o'clock

keen pivot
pallid current
keen pivot
#

Gotcha. Where are you getting 0.06 from?

#

(Like a ctrl-f in the paper)

pallid current
#

'We find that the average top-and-random score after 10 iterations is 0.718, substantially higher than the average score for random neurons in this layer (0.147), and higher than the average score for random directions before any optimization (0.061).'

keen pivot
#

Isn't that optimizing a direction that's explainable?

pallid current
#

0.7 is post optimizing, 0.147 is neuron basis, 0.061 is random direction

keen pivot
#

Thanks for explaining:)

#

So it looks like you're finding really awesome random directions then, lol

#

Like this is their random-only scoring.

keen pivot
pallid current
#

not that it's a random direction

#

and also that's layer 10 for gpt2 XL, not small

pallid current
keen pivot
#

Have you tried looking at the random directions that scored highly yourself?

pallid current
#

well i've looked at the activations that i feed into gpt-4 and those make sense, and also match up to me to the feature activations i got out when got the activations on their own

#

my guess for why i'm seeing surprisingly high scores is that i'm working with 50k fragments from openwebtext, but those fragments are from the very beginning of the corpus, which means there are multiple fragments from the same bit of text

#

and so the top-scoring fragments in the validation set are quite likely to come from the same paragraph as the top-scoring fragments in the train set, which makes the task quite a lot easier

#

changing it now to only take max 1 fragment from each sentence. i think sentences are uncorrelated but will check

keen pivot
pallid current
#

(i think neuron basis is also pretty bad for random samples)

bitter turtle
#

Just a heads-up: pbbly gonna switch to safetensors for the big sweep, it allows memory-mapped/'lazy' tensors which is pretty important for not eating memory when we're doing parallel runs

pallid current
#

ok sounds good

bitter turtle
#

hmm i guess you might not need to idk yet

pallid current
#

i dont know anything about safetensors but seems like it's on the up so interested to play around with it

pallid current
keen pivot
#

https://docs.google.com/spreadsheets/d/1BnFaqn8W9aM1rlosiYFG64VVeeYQNJWFYM-Qt4qj-0o/edit?usp=sharing

Looking at the features of layer 1 across training (for 2k dictionary). My understanding of feature tracks w/ self-similarity.

#

Green means monosemantic, yellow means idk, red means polysemantic
There are 3/6 examples that mostly stay the same feature.
2/6 appear to represent meaningful, monosemantic features, but change over time.

1/6 might represent some meaningful features, but it's unclear but does clearly change meaning across time

keen pivot
#

For the 4k dict, I checked how many features have a self-similarity of 0.9 (after 60GB of data), which then dropped below 0.6:
"The number of vectors that reach at least 0.9 and then drop below 0.6 is 1707."

So plotting all of them, there are 20-40% of features that "changed" (though I haven't proved that high self-similarity implies a meaningful, monosemantic feature)

#

For the 2k dict, it's like 600/2k, so 25% of features that change drastically.

#

One solution may be to train the dict again, but if a feature's self-sim> 0.9 after each mini-run, we learn a gradient mask (Hoagy's suggestion) to keep those features.

This also lends itself to a stopping point: if all features are frozen (because they were >-0.9 self sim after a minirun), then we're done.

keen pivot
#

On the topic of "Other ways to do dictionary learning", this suggests an adaptive l1 parameter. Additionally, the previous section discusses using fast ISTA (FISTA) for the alternating updates (like K-SVD does), which lecun papers also use.

keen pivot
#

I was trying to figure out if this sentence related to the "an" feature, and the translation was just a bit surprising.

Turns out though that "ad" is italian for "to", but only when the following word starts w/ a vowel (else it's "a"). So very much like "an"!

#

One big confusion I have is that the max-activating logit diff doesn't every make sense, even in the last layer, where I'm just directly unembedding.

For example, in distribution, the "an" feature affects the log-prob prediction of vowel-starting-words, but only 1/30 max-diff tokens start w/ vowels, and the one that is is "ural", with no beginning space, so doesn't really count (I think, though I'd count "est" in "c'est")

keen pivot
#

Note: Transformer Lens uses twice the GPU memory as just loading in the model normally. I think baukit is the way to go for larger models & we'll just need to multiply by GeLU for the activations.

keen pivot
#

Oh yes! Oh oh! ahaha Oh...yes

bitter turtle
#

right got some good parallelized training code setup, it's a bit weird and finiky to use since I wanted to do vectorized model ensembling and that doesn't vibe well with multiprocessing/shared memory sometimes so I came up with a bunch of hacks to get around it

#

brief walkthrough of current system if you want to use it (code on my gh)
current setup is you write some functions like this

class FunctionalSAE:
    @staticmethod
    def init(activation_size, n_dict_components, l1_alpha, bias_decay=0.0, device=None):
        params = {}
        buffers = {}

        params["encoder"] = torch.empty((n_dict_components, activation_size), device=device)
        nn.init.orthogonal_(params["encoder"])

        params["encoder_bias"] = torch.empty((n_dict_components,), device=device)

        params["decoder"] = torch.empty((n_dict_components, activation_size), device=device)
        nn.init.orthogonal_(params["decoder"])

        buffers["l1_alpha"] = torch.tensor(l1_alpha, device=device)
        buffers["bias_decay"] = torch.tensor(bias_decay, device=device)

        return params, buffers
    
    @staticmethod
    def loss(params, buffers, batch):
        c = torch.einsum("nd,bd->bn", params["encoder"], batch)
        c = c + params["encoder_bias"]
        c = F.relu(c)

        normed_weights = nn.functional.normalize(params["decoder"], dim=0)

        x_hat = torch.einsum("nd,bn->bd", normed_weights, c)

        l_reconstruction = F.mse_loss(x_hat, batch)
        l_l1 = buffers["l1_alpha"] * torch.norm(c, 1, dim=1).mean()
        l_bias_decay = buffers["bias_decay"] * torch.norm(params["decoder"], 2)
        
        return l_reconstruction + l_l1 + l_bias_decay, (c, l_reconstruction, l_l1)

and make a bunch of instances of it like:

all_models = []
for i, dict_size in enumerate(dict_sizes):
    models = [FunctionalSAE.init(activation_size, dict_size, l1_alpha) for l1_alpha in l1_alphas]
    all_models.append(models)

then, vectorize all the models with the same internal dimensions with FunctionalEnsemble (each ensemble on a different GPU) and send it to dispatch_on_chunk to do all the multiprocessing

#

it's weird but it works and ensembling is good

#

especially for our small dict sizes

keen pivot
#

@manic wind Did you ever work on your idea of contrastive learning of features and their effect on the output?

bitter turtle
#

Ooh, what was that

keen pivot
#

If you really want to do good model editing based off internals, it makes sense to just directly learn which neurons have which effect on the output.

So in this case, optimize for encodings of neuron activations and their respective logits to be similar.

manic wind
keen pivot
bitter turtle
manic wind
#

But would be excited to know if it has any hope

bitter turtle
keen pivot
bitter turtle
#

I guess that's kind of what I meant maybe slightly

keen pivot
bitter turtle
#

anyway on a different tack what hyperparameter ranges should we look at; I am pretty close to being able to start off a run on the pod @pallid current @bronze wraith

keen pivot
#

For which setting?

keen pivot
bitter turtle
bitter turtle
#

like sparse feature extraction is a type of contrastive learning

keen pivot
bitter turtle
#

cool

keen pivot
#

The thing to look at here is the features/datapoint. Shouldn’t be <1 or >100

manic wind
#

I was considering using a package for a particular contrastive learning method called CEBRA that was actually designed for neuroscience experiments. That may or may not be the right move though

keen pivot
#

Totally don’t know about weight decay

bitter turtle
bitter turtle
keen pivot
#

Weight decay is kind of weird because of the distribution of neuron activations, but I agree that something like this should work

bitter turtle
#

Oh, also, does anyone know if the encoder turns out to be some scaling of the transpose of the decoder? it seems reasonable that it should, and i'm wondering if we can get away with doing (x * D.T) @ batch instead of E @ batch

manic wind
bitter turtle
#

could you explain more?

#

maybe in a different channel

manic wind
#

Sure! I actually don't have time now but can explain later if you start a channel (or DM)

keen pivot
#

The median activation of neurons in layer 2 (pythia post GeLU) has a few outliers, w/ the majority being negative because of the GeLU.

Weight decay would interfere w/ this reconstruction, but not if we had the data normalized (I think)

keen pivot
keen pivot
bitter turtle
#

Like if you normalise the columns/rows or whatever and then compare are they similar

keen pivot
bitter turtle
#

So like take pre-GELU data? Hmm idk might as well residual stream it

bitter turtle
keen pivot
bitter turtle
#

Normalise how here?

#

Like, centering, what?

keen pivot
keen pivot
#

Though maybe a weight decay & bias in decoder as you mentioned earlier?

bitter turtle
# keen pivot centered mean & std of 1

that seems like not a very good idea for MLP activations, especially since we don't have a bias on the decoder, and the directions probably don't start at the mean of the data. Also not sure what you gain from whitening the data here. Centering makes sense for the residual stream for sure. Also, not sure it makes much sense for pre-GeLU, since it probably severely impacts model performance and we wouldn't be looking at model activations anymore

bitter turtle
keen pivot
#

And in case I swapped the row & columns

keen pivot
#

But if the bias on decoder helps, then that may just be easier to do.

bitter turtle
#

and train on that

#

since we won't be searching for a dictionary the model 'uses'

#

like, the data is intrinsically not centered

bitter turtle
#

maybe look at mean cosine similarity? Like, check if E @ D where they're both normalised has ~ones on the diagonal?

#

I just don't have any dict/encoder pairs to hand unfortunately:(

bronze wraith
#

I'm throwing in the towel with KSVD. I found it worked very well if you know ahead of time how many features there are, but it struggled if you increase that number by just a bit. In the attached diagram, the 10 standard basis vectors were taken as features in 10D space, and we randomly chose 3 to be active at the same time. The horizontal axis is the number of features you told it to find in each OMP search, ranging from 3 (the correct value) to 6. At x=3, it converges in 23 epochs to the correct features, but for x>3 it takes >100 epochs to converge and at 100 epochs the MMACS (mean max absolute cosine sim) is bad. (I ran these til convergence and the MMACS never became great.)

keen pivot
keen pivot
#

One reason we may want a changing l1 value is because larger dictionaries have high feature-activations/datapoint.

We could instead vary the l1-set point: start it low & increase it until the feature-activations/datapoint are at least X (and maybe increase it if goes below but I don't expect that). Then we can just vary this set point when doing parameter sweeping

keen pivot
#

@pallid current @bitter turtle @bronze wraith Getting really good results looking at residual stream layer 2 (thanks for pushing residual stream Aiden!)

Like types of features (and ablation effect):

  1. Single token detectors (strong effect on bigrams)
  2. German detectors (strong effect on specific German words)
  3. Words after places (other statistical bi/tri-gram stuff? Unsure)

I'm also getting like 1k 2.5k features.

#

For the image, it's the ablation of text effect. So ablating the location-token makes the next token's feature activation go to 0.

#

Importantly, this has a much stronger & intuitive effect when ablating the direction.

#

Notably I made two changes at once (which I need to untangle)

  1. switched to residual
  2. Added a bias to decoder (which may affect the logit-diff ablation effect)
keen pivot
#

Like look at these kids

#

German one

bronze wraith
keen pivot
#

Just not the underlying linear decomposition in your results

bronze wraith
keen pivot
#

It'd be a weird coding error for it to work at the right number of features and not the others

bronze wraith
#

yep!

#

it works shockingly well with the right number of features too!

keen pivot
#

Brass tacks though: I think we’re done. Like these results are huge. There’s a few due diligence things to do, better codebase, and applications, but everyday research can handle that.

bitter turtle
#

we might want to try a kind of hybrid approach where we use OMP/FISTA for the encoder and just regular optimisation/least-squares regression for the dictionary

#

hoagy was saying Anthropic was looking into that for some reason

#

on a similar vein we could tie the weights of the encoder to the decoder

bitter turtle
# keen pivot Brass tacks though: I think we’re done. Like these results are huge. There’s a f...

This seems an overly strong claim. I'd still like to see

  • more thorough analysis of good hyperparameters (inc. bias on decoder, bias norm etc)
  • large-scale (auto)interp on (all/sufficiently-representitive sample of high MCS features)
  • causal scrubbing with learnt features on specific algorithmic tasks
  • an analysis of how complete the feature set we learn is (for example, performance when using idk only features > 0.9; might be a bit pointless because it's heavily overcomplete, but could equivalently do 'performance loss when replacing layer X with reconstruction from sparse dictionary'
  • also, if features are truly correctly represented by the TMOS model, we should probably have a sparseish covariance matrix when rotated into the top-k features for k<dimensionality of residual stream. not sure we have this
#

like, ideally, we should be able to identify circuits using this, and we haven't explored this yet

#

I'm not even convinced that residual stream representations are 'mostly' linear

#

Something something ROME edit failures or whatever

keen pivot
keen pivot
bitter turtle
#

wdym

bitter turtle
#

I guess I'm using the dictionary as a proxy for feature activation here which might be problematic idk feel like that should cancel out

keen pivot
# bitter turtle wdym

The residual stream carries information from layer 1 to 2 so I expect dictionaries learned for both will have a large amount of overlap

bitter turtle
keen pivot
bitter turtle
keen pivot
keen pivot
#

Though I saw several different types in layer 2 of Pythia

bitter turtle
#

oh, right, that, yeah there's going to be some difference but it's a continuum and for neighbouring layers I'd expect high similarity

bitter turtle
keen pivot
errant nova
# keen pivot Made my much better post! https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tent...

I’m enjoying reading this! If you want us to promote this or any other materials y’all put out on the EleutherAI blog or social media just ask.

I noticed a couple weirdnesses in this paragraph:

To be clear, I am running datapoints through Pythia-70M, grabbing the activations mid-way through at layer 2's MLP after the GeLU, & running that through the autoencoder, grabbing the feature magnitudes ie latent activations

  1. ”Pythia 70M” is actually named Pythia 160M. I know having the model names change is annoying, but it’s much less annoying than having different models in the suite follow different naming conventions!
  2. The MLP and Attention layers in Pythia are computed in parallel. Does “after Layer 2’s MLP” mean “before Layer 2’s attention layer writes to the residual stream”?

Finally I was wondering if you had tried applying the Tuned Lens and if the Logit Lens was giving you better results, or if you hadn’t tried the Tuned Lens yet. IIRC the Logit Lens does work reasonably well for Pythia, but you should expect it’s behavior to fall apart when examining other models like BLOOM and GPT-Neo

#

Also, @keen pivot you now have the Research Lead role. The primary change this brings is the ability to pin and un-pin posts, edit channel descriptions, delete posts, and assign low-level roles.

Please use this power primarily to manage this channel, but if you feel like spending some time doing miscellaneous moderation tasks and cleaning up spam we’ll hardly complain. You don’t have the power to ban people; if someone needs banning that will need to be referred to a Staff member.

bitter turtle
#

(keep aiming to look into this and never doing it)

#

(maybe I will eventually, but not particularly confident about actually getting any useful results)

keen pivot
# errant nova I’m enjoying reading this! If you want us to promote this or any other materials...

Thanks!:) I'll probably take you up on getting it promoted, probably w/ a better post this week.

  1. Updated the Pythia 70M to 160M on both posts; thanks! [Edit: Looking at the table of the bottom of https://huggingface.co/EleutherAI/pythia-160m, pythia-70M is 6 layers & 160M is 12. I'm currently using the 6 layer one, so unable to square this]

  2. Correct. Last post was "mid-MLP" as in after the first linear layer & activation function. The latest post is the residual stream w/ much better results (link: https://www.lesswrong.com/posts/Q76CpqHeEMykKpFdB/really-strong-features-found-in-residual-stream)

  3. Regarding Tuned/Logit Lens: the latest post does get much, much better results here, this is for both the logit lens & ablating the feature direction. I would like to integrate Tuned Lens here though, especially for larger models & if applying to BLOOM/OPT/GPT-Neo.

[I'm writing this quickly because the results are really strong. I still need to do due diligence & compare to baselines, but it's really exciting!] …

pallid current
keen pivot
errant nova
bitter turtle
keen pivot
keen pivot
pallid current
#

something i'd be interested to test in the upcoming sweep is what would happen if for the MLP activations we constrained the features to be positive only.

#

you'd need to allow a bias to help it account for negative activations

#

but i wonder if it would help it towards correct solutions given that we expect most features to be pretty much entirely positive - it's kinda strange to think about what a negative valued feature would look like in the MLP, like working only in the negative range of the GELU would make it extremely sensitive to interference. i suppose you could have a 'feature' which cancels out the activation of other features at certain times, though maybe that would better be understood as just being a part of how the activation conditions for the positive features are defined

#

will check later today whether we're seeing significant positive activations in our current dicts

#

sent an email to wes gurnee about getting the distributed features that they found in early pythia layers which respond to particular n-grams

#

am now on PST time btw! and will be properly back in the swing of things on monday

bitter turtle
#

should be equivalent mod dead neurons ig

pallid current
#

why would preprocessing with relu have the same effect as constraining features to be positive?

bitter turtle
#

no incentive to learn negative features

pallid current
#

the kind of negative interference i'm imagining could still happen with positive-only activations i think

bitter turtle
#

might be misunderstanding what you mean by positive here

pallid current
#

just that each entry in the decoder matrix would be positive only

bitter turtle
#

if you mean 'only directions with +ve coefs' should be the same

#

well, no incentive to learn negative features

#

or, if negative features were learnt, they would be equivalent to 0

#

like, I don't see why the encoder should need to use the negatives in the activation at all

#

oh I am totally misunderstanding yeah constrain it to be positive ignore me 👍

#

like, I would see passing the activations through ReLU as "cleaning the GELU noise so the dictionary doesn't pay attention to it" and constraining the dict to have +ve coefs as "limiting the computation the dictionary can do", is that what you're getting at @pallid current?

pallid current
#

i'm wondering if there are activation vectors that the model learns to explain as (a * feature x - b * feature y) which doesn't fit my model of how features should work

#

but that also might be a problem for my understanding of features so i would only be doing it as a tentative test

pallid current
#

which bit? i'm really interested in what would make a good test for whether gelu helps with interference, probably along the lines of the test that you wrote up a couple of weeks agoi

#

somewhat related but there's something i really want to test with that setup that i thought of last night. setup is you have multiple MLP layers in sequence which are trying to calculate n distinct features, where n is more than total number of neurons. question is whether, for each feature that is calculated, are they calculated in one layer only, or does the neuron e.g. do some preprocessing in layer 1 to calculate in layer 2, or perhaps use layer 2 to clean up interference from layer 1?

bitter turtle
#

total meaning 'more than both MLP layers combined' here?

keen pivot
#
    1. learnable l1 (based off features/activation)
    1. Re-init features when dead
    1. Perplexity difference
    • a. When replacing whole layer w/ reconstruction
    • b. Just high-MCS features (and potentially high-MCS datapoints only)
    1. Keep features if self-sim=0.9 after N_GB & not dead
    1. Decoder L2/simplicity term
    1. Affine decoder vs linear
    1. Changing toy model to better match current performance
    • a. write down current differences
    • b. Residual stream outperforms (maybe look at sae effectiveness with noisy data)
    1. compare perf over entire dataset Vs like QA or math @bitter turtle , what metric for performance did you mean
    1. Circuit finding & causal interventions (on algorithmic tasks?) @bitter turtle
    • a. How to find circuits if doing residual stream
    1. Tuned Lens
    1. Better wandb/aws setup
    • a. Easily get graphs on same page (Do we just manually do it every time? Groups mess it up)
    • a. naming scheme when uploading to aws isn't useful. Just timestamp, when model_name & layer would help
    1. Large model features
    • a. switch to bau-kit for >6B models
    • b. 1B-param features
    1. Auto-interp - good for hypothesis refining of input, but what about:
    • a. hypothesis testing on effect on output (ablating direction/logit lens)
    • b. marking "interesting" features (or categorizing features in general?)
    1. Aiden's TMOS k-covariance thing(?) @bitter turtle
    1. Talk to expert in dictionary learning
    • a. Anthropic
    • b. maybe MIT person Logan knows?
    1. Compare w/ Baselines: PCA & Reconstruction ICA
keen pivot
#

Aiden, could you clarify these things? (I think you'll need to explain some of them again, sorry. I can read back later, but currently pinning post. No hurry though)

bitter turtle
#

ok so the circuit/smaller dataset stuff is basically because I slightly feel like you'd get more linearity/truthful representation by a sparse basis on [some] limited algorithmic tasks, and I wanted to see if that was the case.

keen pivot
#

@pallid current when we do the big run to compare things, it’d be good to have a set seed for the data generation as opposed to just shuffling. Also, I believe the Pile is already shuffled by default

pallid current
#

agree on the seed, what does it mean to shuffle by default? just the parameter that goes into load_dataset(shuffle=True)?

bitter turtle
keen pivot
keen pivot
bitter turtle
keen pivot
#

If the big run handles all hyperparams we care about it should be good, but it’d be good to still have a set seed for the data shuffle if we care about replicating later or think of some other setting we’d like to compare to.

keen pivot
#

We got recommended a paper written a few years ago ( @pallid current , I forgot the person's name?) for variational sparse encoding: http://proceedings.mlr.press/v115/tonolini20a/tonolini20a.pdf.

Section 2 gives a nice overview of related work to contextualize it, but I'm confused on the claim of what normal sparse coding is missing that VAE's help w/ (or why not just an AE like our work?). Haven't read more than 10 min, but would appreciate if someone else could look at it!

bitter turtle
#

@keen pivot willing to go through this with you so we can bounce thoughts off eachother; first thoughts are that

  • we get more control over the latent space by using a VAE, which might result in better learning/convergence if we set our hyperparams right
  • typically VAEs use a more powerful recognition model than our current encoders; probably useful if transformer representations are fundementally nonlinear/better denoising is needed than a simple RELU layer
    I believe @blazing yoke was talking about this in #eliciting-latent-knowledge at one point, they might have been the one to reccommend you the paper. I'd be very interested in persuing this.
bitter turtle
keen pivot
#

If you'd like to take ownership of implementing it, that'd be great. I'm currently doing different sparsity constraints atm, and maybe even tied embeddings if we're not doing that in the big sweep

bitter turtle
#

We can definitely do it in the big sweep.

#

Or, a sweep.

kind scroll
# keen pivot We got recommended a paper written a few years ago ( <@566946805028225034> , I f...

Hey - I mentioned the paper to Lee a few days ago so may have been me. I've been loosely keeping tabs on this thread. It's a good point that the paper didn't compare with normal sparse coding. A friend of mine wrote the paper, so I'd be very happy going through it with you. The answer aidan posted is pretty accurate. There's many principled differences between VAEs (which aren't really auto-encoders), and auto-encoders.

bitter turtle
kind scroll
#

Generally, and in my experience, autoencoders need a lot more hacks for learning the kinds of representations you want. They suffer from things like mode collapse, and it turns out the isotropic gaussian latent space is kind of an okay choice in VAEs.

#

VAEs are super easy to implement though - I could show you the ML/code side in < hour, and the probabilistic theory in a couple hours.

keen pivot
kind scroll
bitter turtle
#

oh was about to say found one but would still value a summary paha

kind scroll
#

Not sure if you're still looking for thoughts on the paper @bitter turtle - I hadn't seen it before but I've skimmed through it. Could you share the summary you found?

#

Intuitively, from my perspective, it follows a different probabilistic derivation than the VAE paper

bitter turtle
#

not the summary

bitter turtle
#

one sec let me bring up the messages you sent originally

#

I guess you were approaching it from a slightly different perspective

blazing yoke
#

Yeah, I ended up deciding the right thing to implement was probably this.

#

But I ended up doing other projects first.

bitter turtle
blazing yoke
#

There don't exist a lot of language VAE architectures.

#

But um, Katherine made a flow encoder thing you could add to it that would make it implement the right inductive bias for language.

blazing yoke
blazing yoke
#

So you have to use like, an iVAE or one with hyperbolic geometry

#

We were going to combine Katherine's flow model with Optimus.

#

You'd probably be better off asking her for the technical details, since I didn't implement any of the flow model.

#

@opal basin

opal basin
#

Oh, the normalizing flow part was to allow the autoencoder to learn distributions with arbitrary/weird shapes while retaining the ability to sample latents

#

Because with a VAE your posterior is normally diagonal Gaussian which is not good for language

#

We haven't actually tried it on text yet

#

If you don't care about the ability to sample from the distribution of latents, or determine the information content/likelihood of a latent, you can use a normal autoencoder (not VAE) which also lets it take on arbitrary shapes

blazing yoke
#

Implementing this was still on our todo so, would be very happy if you did.

bitter turtle
#

we'd be doing it for transformer internal representations not text, not sure if the non-gaussianity thingy still holds there

opal basin
#

ahhh

bitter turtle
#

why don't gaussians work for text

opal basin
#

tbh i'm not clear on the exact details but they impose a Euclidean geometry on the space and for text you want the ability to stuff trees into the latents, which means you want hyperbolic geometry. this is kind of conjecture

#

"Adding Gaussian noise imposes Euclidean geometry and this is empirically not good for text" is the part that isn't conjecture i think.

kind scroll
bitter turtle
#

riiiight so would the conjecture would like imply that transformer internal reps are well described by something on hyperbolic geometry, so maybe it still applies here? we might end up using it if it works better, im adding to the list of evergrowing things to test at some point in the future

opal basin
#

nods

#

I wanted to be able to sample from the distribution of latents and determine their likelihood so I took a normal autoencoder and added an RNODE (https://arxiv.org/abs/2002.02798) that converted from its latent space to N(0, I) and back.

blazing yoke
#

But um, to my memory I said a couple different things about VAEs and ELK. The most relevant one is probably that the use of a decoder only transformer makes ELK harder because you can't easily get embeddings and explore the latent space of the model and characterize it. A VAE is useful because it lets you estimate the information content of the latents.

opal basin
#

The sampled fakes actually looked like real images, which you usually don't get fully with a VAE!

blazing yoke
#

Which for ELK is important because it lets you figure out if your translator is hiding detail. If there's a mismatch between the amount of information in the latent and the explanation, you know something funny is going on.

bitter turtle
blazing yoke
#

Wasn't there some complication you ran into with scaling that caused you not to use it for text right away?

opal basin
#

With a Gaussian VAE you try to approximate the posterior by sampling from the N(0, I) prior and decoding that, which you have tried to make the posterior (encoder output) resemble, and this kind of works

bitter turtle
#

yeah but the RNODE thing

opal basin
#

With my flow autoencoder you can sample from N(0, I) and run it through the flow model to obtain totally new latents to decode.

#

RNODE is a continuous normalizing flow, an invertible map between two arbitrary probability distributions.

#

In this case it maps between N(0, I) and whatever the learned distribution of encoder outputs is.

bitter turtle
#

oh shit ok 👍 magic black box to convert distributions gotcha

kind scroll
opal basin
blazing yoke
#

(Hers does though)

kind scroll
#

for sure

blazing yoke
bitter turtle
kind scroll
#

Had you heard of the spike and slab distribution before?

bitter turtle
#

nope

kind scroll
#

When Francesco brought it up for his paper he said it was an (obscure?) Physics thing.

bitter turtle
#

but it seems close to what superposition is saying; toy data for testing SAE methods is generated using ~that distrubution

errant nova
#

I’m familiar with spike-and-slab regression

bitter turtle
#

oh?

errant nova
#

[deleted article because it’s not very accessible, let me look for a better one]

bitter turtle
#

Is that just linreg with a spike+slab prior on the coefs

errant nova
#

Yeah

#

It’s spiritually similar to ridge and lasso

#

It’s used to quickly work through a large number of mostly useless variables

#
bitter turtle
errant nova
#

At the end of the day it’s like quibbling over the names of different types of knives. Maybe useful for propel who really care about knives but you probably just want to make sure it’ll work and not cut you.

bitter turtle
#

pahaha fair enough that's mildly encouraging

bitter turtle
bitter turtle
#

sound

pallid current
# bitter turtle Also a note on AEs vs VAEs: https://stats.stackexchange.com/questions/324340/whe...

bit confused about this in the answer: 'VAEs are known to give representations with disentangled factors [1] This happens due to isotropic Gaussian priors on the latent variables. Modeling them as Gaussians allows each dimension in the representation to push themselves as farther as possible from the other factors.' as i understand it, if you have an isotropic gaussian prior then the distribution should be invariant to rotation, which means that there's nothing distinct about the particular basis. where am i going wrong?

pallid current
#

also finding it difficult to get an intuition for the role of the pseudoinputs in the VAEs. @kind scroll if you've got time to walk us through the paper some time in the next week that'd be brilliant!

opal basin
#

I suspect the answer is wrong and it's actually due to the posterior being diagonal and thus not rotation invariant

bitter turtle
#

instead of having a purely Gaussian prior you have a set of learned pseudoinputs which you feed into the latent space predictor to get your prior(s) which you then mix like mixture of gaussian

#

This is better because (?) and it incentivises high-variance posteriors/latents/eh

bitter turtle
kind scroll
cosmic moon
bitter turtle
kind scroll
#

I'll link a couple papers later today. The short answer is: it doesn't really. It's partly just nice for an analytical form of the evidence lower bound.

cosmic moon
#

re PCA as a baseline, fitting a GLoRA decomposition might be interesting to investigate as well. https://arxiv.org/abs/2306.07967

keen pivot
#
  • Review of different methods:
    l_1/2 - Nothing noticeably different than l_1 (no better sparsity for the same reconstruction loss). Worse high MCS (though maybe optimizing for an even better alpha would work). Didn't look at individual features.
  • Adding noise - can't compare (sparsity, reconstruction loss) because it's at a strict disadvantage to the non-noise one (would need to compare on clean data for both). Learns maybe 1/3 number of features (boo) but 0 dead features (yaaay).

Looking at individual features, many appear just as meaningful as l1, however, the logit lens is much worse for some of them (Logit lens being worse isn't as meaningful as ablated direction being just the same). Additionally, there are like 10 features for (beginning and end of first sentence), whereas l1 only has 1 of those.

Could try tied embedding for both l1 & noise to compare.

#

Though, I'm unsure how normalizing the weights of the decoder (which are now the encoder) will effect things?

bitter turtle
#

awesome

#

I've just about got the big sweep code debugged we can set it off tomorrow

keen pivot
#

I guess I'm confused about implementing the tying. Like I want them to be in the same direction, but I'd be okay w/ different biases. Same w/ only normalizing the decoder

keen pivot
bitter turtle
#

can do either not sure which one we should do first

bitter turtle
keen pivot
#

For tracking, I want features/datapoints ('sparsity') & dead features:

dead_features = (dict_levels.detach().mean(dim=0)==0).count_nonzero().item()```
#

Default lr has been good to me.

pallid current
keen pivot
pallid current
#

ok fair, in that case i think we'll want to measure the average activation, or total number of non-zero activations to understand whether they're dying

#

maybe even also average activation, given that it's active

keen pivot
#

and average activation for dead features will be 0

pallid current
#

like, we want to distinguish between 99.99% 0 activity and something a tiiiiny bit, vs usually 0 with occasional strong activation, which could be a healthy feature

keen pivot
#

I'm unsure, could be very rare features

pallid current
keen pivot
#

Update: Ya, adding noise (instead of an l1) does produce sparsity & sometimes meaningful features, but not as good as l1 & importantly the logit lens just sucks w/ it (compared to l1)

keen pivot
keen pivot
#

@bitter turtle, early results are in & tied embedding looks quite good. I believe you're the one that suggested both tied embeddings & residual, both of which have been really good, so thanks!:)

Additionally, tied embeddings may allow it to work on the MLPs if we want.

#

Of biggest note: I'm getting ~1.4x as many features w/ the same amount of data. Plus the added benefit of reading in from the same direction we're writing out (though I have two different biases because they're different shapes)

Of also big note: average features per token went from 100 (untied embedding) to ~20 (tied embedding) when optimizing over L1_alpha for high-MCS.

keen pivot
pallid current
#

sweet, how many you getting on what activation dim?

keen pivot
#

pythia-70M, so 500

#

Looks like 1k for the 2k dict

#

Probably more if I went larger

#

And the features & logit lens makes a lot of sense.

pallid current
#

fiiiiiinally managed to replicate @keen pivot's interpretations of high MCS neurons using the openai autointerp system

#

no idea what the bug was but yeah should be able to have more confident autointerp results coming along in the next few days and can actually help with the main bulk of the work

#

i'm a bit behind but i think the first thing to do is a comparison of high MCS / low MCS? most important is to do sparse coding vs neuron basis, but i think we need to adjust for the negative biases before we can call this a fair comparison

keen pivot
bitter turtle
#

For sure you would be less able to view the directions discovered by the SAE as wholly meaningful if it turned out the bias mattered a lot but I'm thinking that even in that eventuality you could still get some millage out of the representation.

pallid current
pallid current
#

am more excited by just comparing neuron basis with negative bias + relu applied, but will also compute the raw comparison

pallid current
#

the features are so much richer now that it's fixed 😊

#
Feature 8, explanation='phrases and keywords related to the legal system, law and legislature.'
Feature 8, score=0.29
Feature 9, explanation=' underscore characters, especially in the context of code or programming syntax.'
Feature 9, score=0.26
Feature 10, explanation='terms related to astronomy and cosmology.'
Feature 10, score=0.41
pallid current
#

this is without the aforementioned adjustments to the neuron basis so take with a pinch of salt

#

though also, i'm not seeing a strong relationship btwn MCS and autointerp score

bitter turtle
#

How many tests is that, just wondering what levels would give less noisy data

pallid current
#

roughly 60, 60, 120. can scale up easy tho i think the difference is very clearly significant

#

neuron basis vs sparse code is like 5 sig diff in means

#

missed a few actually so new one is

keen pivot
#

That’s awesome!

#

I also feel proud of the “logan_ae”, haha

pallid current
pallid current
#

hmmmm first indication is that adding the relu(activation + bias) makes the neuron_basis interp worse, good if true but very surprising to me

#

i suppose it could cut off legit activations, making the simulation less accurate

#

still, need to do a bit of checking before i feel confident

pallid current
#

ok i think it's working at intended !

#

graph's a mess but it's looking really good! more hyped about sparse coding than i've been in ages!

keen pivot
#

How'd you do neuron-basis-bias?

pallid current
#

just took random biases from the encoder and added them to the neuron output, then added a relu

#

all biases were negative so it makes some level of sense

#

i realised that the bias should be scaled by the norm of the encoder tho, might implement that tonight, otherwise tomorrow morn

#

wonder if there's a more principled way. in the openai autointerp paper they get the gradient of the feature wrt the interp score, but i think that would then be unfair in the other direction, as well as sounding like a pain

#

you could also target a particular sparsity of activation

bitter turtle
#

Would love to see the ICA and PCA baselines, but that looks crazy good!

keen pivot
#

@pallid current the autointerp develops hypotheses from max-activating examples, right? For me it's misleading because the max-activating are sometimes too specific, so I look at a uniform distribution.

#

Also, is this the MLP dictionary?

#

Do you have a list of goals or milestones for auto-interp?

pallid current
pallid current
#

meeting lee in a sec so will try use that to write a proper plan of things to do but i plan to:

  • do a bit more work to try and make sure that the comparison with e.g. neuron basis is a fair one
  • write a quick post showing the results
  • also, if the big sweep goes well, getting autointerp scores for sparse coding and a few baselines on lots of different layers would be worthy of a major writeup, possibly a paper imo, i think it would get a lot of people interested
#

will try to have the preliminary writeup soon and then talk to anthropic with that in hand

keen pivot
#

These are the things I thought of for auto-interp. I can try implementing them in a week or two (though looks like you've got PCA/ICA covered!):

  • Improving prediction of input:
    • Include ablated context one-token-at-a-time effect
  • Predicting ablated output
  • PCA & ICA (what do top components look like?)
  • "Interesting" directions (like accents, medical-speak, SE-speak) & in general categorizing features
pallid current
# pallid current Will run this morning

getting OOM errors on PCA 😦 not sure what's changed, might just be using more data now, will try both switching to @bitter turtle's batched version and just reducing the amount of data fed in.

bitter turtle
#

Re: dead neurons, I'd also like to get around to doing the dead-neuron reinitialisation (and maybe low-MCS reinitialisation?) at some point

pallid current
bitter turtle
pallid current
#

what's the measure that would tell us whether MCS is good? just whether reinitialization of low-MCS produces better dicts?

bitter turtle
#

sure

#

definitely sketchy, I think we should also look into just accumulating tons of other possible (non-ai-supervised?) metrics

pallid current
#

in current runs, are we still seeing recon loss and l1 loss rising through the run?

#

i feel like that should be a bigger part of our metrics than it is

bitter turtle
#

^

pallid current
#

suppose at large intervals we could run the perplexity check that we spoke about before, run some comps to see how well the model is able to function with different dicts

#

btw recently tried the method of applying biases from the encoder but where the bias is scaled down by the norm of the autoencoder, doesn't help the autointerp on the neuron basis at all

keen pivot
#

Regarding bias, the tied embedding has this for the encoder:

#

I looked at a few in the right cluster & they're typically dead (~0 non-zero activations). This is ~400/2k features. So maybe just originally init bias of encoder to uniform[-1, -3]

#

I will note, I've noticed before the one odd positive bias feature. It like kind'of looks like a feature regarding the input, but doesn't have a meaningful affect on the output nor meaningful logit lens.

#

I was going to look at a data-centric viewpoint: given a datapoint, how many features activate? Do those features make sense? For example, if most of them make sense, but this positive bias one always activates, then that's a clue that there's funny business going on.

bitter turtle
keen pivot
bitter turtle
#

well, ideally we want to find directions corresponding with useful features, and the point of the bias + relu is to act as a bit of a noise-reducer to cancel out the interference effect of other features, which shouldn't* be anything too significant

*we should check the variance of activations, but if they are even like less than 100 (or 1000?) or something (ballpark orders of magnitudes) then the bias is doing something more than basic denoising

pallid current
#

agree that initializing negative biases doesnt sound good for similar reasons to aidan. like, you find a direction, and then make the bias negative to remove the noise, but if you just start with large negative bias you're quite likely to just find nothing at all

pallid current
keen pivot
keen pivot
keen pivot
pallid current
#

only applies to resid stream

#

which are the graphs from?

pallid current
keen pivot
#

Residual stream, tied

bitter turtle
#

So short term probably not is my take

pallid current
#

i'm still interested in this on theoretical grounds because it seems like we should expect features to be composed of a small-ish number of neurons if it is using the non-lin in the way we expect. it got semi-shelved when we found that the features sparse coding was learning were weirdly less sparse than random vectors. but i understand that seems to have been an artefact of working with the nanoGPT model?

bitter turtle
#

uh. how are you defining sparse here?

#

im just confused how they can be less sparse than random vectors, which should just be not sparse

pallid current
#

maximally nonsparse by this metric would be every element being equal

#

comes out basically the same as entropy

bitter turtle
#

hmm. is that the definition we want for sparse here?

pallid current
#

maybe not exactly but i think it would give a strong signal if the features were focussing on only a few vectors

#

what dyou think's missing there?

bitter turtle
#

some sort of centering at zero, but yeah would give out strong signals for sure

#

don't think you can read too much into the sparse coding vs random vectors thing tho, it seems that random vectors should be close to totally unsparse by a better definition

#

I mean, I literally think 'normal sparsity but close-to-zero' seems reasonable. (or maybe if some coef contributes say 1/100*n_dims of a vectors norm or something arbitrary like that)

pallid current
pallid current
#

current residual stream results. here the '''neuron basis''' seems to be good?? and matches the sparse-coded features.. but this time the top MCS features are significantly better, and both far out perform random. the green is the sparse coded features after 1 epoch which is somehow worse than random (low sample size)

keen pivot
#

Are you able to look at specific examples? There's 3 in the neuron basis that have a score of 0.5 (maybe 6 for .5 & 5.5 in total?)

pallid current
#

i think these results are basically positive and show that we are getting something legitimate from our dictionaries, and we need to both scale up to more layers, and refine our learning process

keen pivot
#

Also, would -1 mean reversing it's answer would give us 1? ie reverse_score = abs(10-oldscore) or something like that?

pallid current
#

yeah like the explanation is perfectly anticorrelated with the activations

keen pivot
pallid current
# keen pivot Are you able to look at specific examples? There's 3 in the neuron basis that ha...

yeah i can give you the ids and explanations, ids are [1, 7, 37, 66, 69, 90] and explanations are all pretty similar: numeric values, sequences, and lists., dates, particularly those written in the format of month and year numeric values and codes, including year dates and programming syntax., numerical data and product identifiers ,numerical values and sequences numeric values, including single digits, multi-digit numbers, and percentages.

keen pivot
#

Something else I've noticed in the residual stream is we're picking up the weird language model stuff, like the 8bit guy? Oh here: https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/

#

For Pythia 1.4b, there's a HUGE activation of 1.5k. Typical activations are in the 1-80 range, w/ maybe 5 as the median max value

#

One hypothesis is we're picking up on what Tim called these outlier dimensions that models have, which coordinate in the 6Billion range & get really big. Smaller models have lots of these, but they're not coordinated, so you get several of them in the 60 activation range.

keen pivot
#

I'm curious how it does regarding the more style-based ones. Like the SE style or Chemistry, etc

pallid current
keen pivot
#

Which is exactly what I noticed: Pythia-70M has several "beginning & ending of sentence" in the range 60 activations. Pythia 1.4b has 1 of the same type that's 1500 & another that's in the 60 activation range.

#

Two things of due diligence for myself:

  1. Check that the direction mainly comes from one dimension (which I believe Tim is claiming)
  2. Check the amount of features of this type for both models, plus their activation range
pallid current
# keen pivot A bit more subtle than that. He claims that these outlier dimensions exist in mo...

ahh cheers yes i should have kept reading. your 1600 values is way off the charts still tho:
"The phase shift happens around 6.7B, where 100% of layers use the same dimension for outliers. At this point, a couple of things happen rapidly:

Outliers become very large quickly. They grow from about 15 for a 6B model to about 60 for a 13B model. OPT-66B has outliers of size around 95, which indicates this growth phase is temporary."

#

can imagine these things being very senstive to training setups tho

keen pivot
#

Ya this first one (the 1500 one) is mainly two dimensions:

#

For comparison, here's top-MCS 100 which is the feature "the [noun]"

pallid current
#

plotting the contribution of each dimension?

keen pivot
#

The weight of the decoder & it's tied

#

This is the other one which is activation 50 for only first words (as opposed to beginning & ending):

keen pivot
keen pivot
#

Though looking back at the google sheet, It (ie the feature "beginning & ending of sentences" w/ high activations) is also the top-MCS feature for layer 2,3,4 MLP

#

Additionally there's a high-positive bias one w/ similar properties:

#

Notably there are several overlapping residual dimensions (e.g. 568, 516, 468, 1326, 934, etc) w/ the first high-MCS & high-activations one.

#

Notably notably, they seem like opposite of each other when you look at activating examples.

bitter turtle
#

Wrt getting disentangled features for VAEs: https://arxiv.org/abs/1907.04809

keen pivot
bitter turtle
keen pivot
pallid current
#

more graph horror but finding that the sparse coding directions strongly beat ica and pca, as well as all others, on pythia layer1 mlp

#

principal components start interpretable but fade quickly

pallid current
#

also started separating out the top vs random scoring. results are generally good, the score goes down, but for any where the top score is >0.2, the random score is also almost always solidly positive

#

average drop of maybe 0.1

keen pivot
#

Oh that's awesome! Glad you've been working on this!

keen pivot
#

Their predicted text is usually pretty easy too

#

Oh, would be good to rerun toy example stuff on tied embedding autoencoder.

#

I think I'm getting dead neurons by overtraining.

pallid current
#

i'm not seeing a correlation between activation level and interp score 🤔, also not with interp and MCS

keen pivot
keen pivot
#

Things are looking good on the MCS & MMCS front. For residual, we're getting MMCS of 0.8 for 4k (d_model=500), 50% above 0.9, and the histograms look really good.

#

I'm getting the same results for layers 1-4, will additionally do layers 5 (& 0 for the heck of it). Usually I'd need a lot more data to get almost these good results for untied embedding, and the features/token here is ~20 where previously it was ~100

#

20 also seems right because you have features like "a-letter words" & "after a-letter words" along w/ a bunch of low-activation "noise" which also tends to make sense on it's own, but also might point towards a problem/

keen pivot
#

@bronze wraith , would you be able to think through the math of a tied embedding & what would be best to best reconstruct the original data?

As an example, suppose the weights are [1,2] & the latent feature is 10. Then the reconstructed feature is [10,20].

But if the input was [10,20] then the latent feature would be 50. I guess you could have a negative bias of -40 for the encoder for it to work on this example, but that's as far as I got & thought you'd be better at this kind of thing.

#

(additionally, we're normalizing the decoder weights because of the l1 penalty for typical dictionary learning, but that now means we're also normalizing the encoder weights)

bronze wraith
#

Also, I'll give it a shot, but I'm about to leave for a weekend trip, so I might not have much to say until next week!

keen pivot
#

The encoder and decoder are the same linear transformation, but transposed

keen pivot
#

Here is the relevant bits of the model definition.

        self.decoder = nn.Linear(n_dict_components, activation_size, bias=True)
        # Create a bias layer
        self.encoder_bias= nn.Parameter(torch.zeros(n_dict_components))

        # Encoder is a Sequential with the ReLU activation
        # No need to define a Linear layer for the encoder as its weights are tied with the decoder
        self.encoder = nn.Sequential(nn.ReLU())

    def forward(self, x):
        c = self.encoder(x @ self.decoder.weight + self.encoder_bias)
        # Apply unit norm constraint to the decoder weights
        self.decoder.weight.data = nn.functional.normalize(self.decoder.weight.data, dim=0)

        # Decoding step as before
        x_hat = self.decoder(c)
        return x_hat, c

bronze wraith
#

One thing I think I can say about tied embeddings: if your encoding matrix is M=[I_n, -I_n], I think you get a perfect reconstruction (i.e. M^T ReLU(Mx)=x for all input vectors x).

#

This works because Mx is [x, -x], whose negative terms get zero'd out by the ReLU, and then when it is multiplied by M^T you get back the original.

#

This is a solution to "given that our embeddings will be tied, what dictionary features could we learn to get a good reconstruction", but doesn't account for a) the L1 penalty, or b) the noisyness of the training process. And this solution isn't unique: you can get a perfect reconstruction with tied embeddings M=[U, -U] for any unitary matrix U (https://en.wikipedia.org/wiki/Unitary_matrix).

#

Those are some examples I thought through before, but Logan I'm sure you had some other questions in mind too. What other angles do you want to think through this from?

keen pivot
bronze wraith
keen pivot
#

Also, no problem if you just enjoy your trip!

bitter turtle
# bronze wraith This is a solution to "given that our embeddings will be tied, what dictionary f...

@keen pivot generally we should note that the L1 penalty basically solves this rotation-invariance if the underlying latents are sparse (rotation doesn't preserve L1 norms, and L1 norms are minimised when the rotation produces as sparse data as possible). Not certain how interference affects this when we have an overcomplete basis, but for e.g. binary features with some constant maximum interference between distinct features you get minimum necessary bias for total denoising when the rotation is aligned to the underlying overcomplete basis (this is basically the reason that I think L2 norm on the bias is a good idea; also in this case - perfect denoising - you get perfect reconstruction with the tied weights)

Of course, real life is More Complicated than this, but I guess this kind of explains the intuition for the guess of 'hmm tying weights seems like a goodish idea'

pallid current
#

also is there a theoretical reason why we in the tied autoencoder we have a bias on the decoder? i don't have strong opinions either way but i remember @worldly hinge being quite anti it, cant remember why

keen pivot
pallid current
#

i promise i'll fix these graphs soon but......

#

separating out top and random scoring, sparse coding outperforms all baselines for both top and random on the residual stream!!

bitter turtle
bitter turtle
pallid current
#

yeah adding the bias back on makes sense to me too

#

@bitter turtle what's the status of 'the big run'? are we ready to try out multiple layers * multiple dictionary learning approaches

bitter turtle
#

Can go whenever code works I think

#

Should currently be set up to save every chunk

#

@pallid current currently set up to just test different parameters (i.e. do a big grid search over L1, dict size, L2 reg)

#

Could set up to test tied weights Vs not, etc, wasn't sure what would be useful ig

pallid current
#

splitting out only top scores also makes the value of sparse coding in MLP more clear:

bitter turtle
#

Sorry don't understand the graph: top scores by what metric? Which label corresponds to those scores?

pallid current
#

right sorry i never explained this at all. so what the autointerp does is to take 5 out of the 20 fragments of 64 tokens which have the highest average feature activation, from a pool of 50000 fragments. these are the 'top' fragments. it uses those to generate a hypothesis for what the feature 'is'. then, it takes another 5 random fragments. it uses the explanation to generate a guess for what the activations will be across both the top and random fragments. it then scores the explanation based on the correlation between predicted and actual activations.

#

so the score that i've been reporting previously is the correlation across those 10 fragments, called 'top and random' scoring.

#

but we found that this was a bit misleading for some of the residual stream neurons, because the explanation was able to distinguish clearly between the top fragments and the random fragments at the fragment level - ie high in the top fragments, low in the randoms, but it couldn't predict any of the variation within the fragments

#

so instead i'm now showing scores for correlation within the top fragments and within the random fragments separately

#

and on both of these measures, the sparse-coded features come out very clearly ahead, for both residual stream and mlp

pallid current
bitter turtle
#

yeah code works afaict; trained a few dicts for a very small amount of time on a small amount of data and loss did expected things, and it was faster I think

#

Can push to main if you want

#

On mobile ATM tho

pallid current
pallid current
bitter turtle
bitter turtle
#

sorry not search

#

big grid training thingy

pallid current
#

oh right gotcha

bitter turtle
#

yeah mb badly worded

#

'I can try a bunch of different hyperparams for tied and not tomorrow'

pallid current
#

yeah that sounds great

#

what was the conclusion wrt reconstruction ICA in the end?

bitter turtle
#

no idea haven't tried it never got round to it

#

can also try that

#

Not particularly expecting it to be much different from tied weights

pallid current
#

yeah is there actually any difference (except maybe the smooth_l1 loss)?

#

no bias

#

no bias would be kinda interesting from an autointerp point of view (though could just remove the biases manually) just because it means that the found directions are on a totally even footing with the baselines

bitter turtle
#

yeah don't think it's particularly interesting. More interested generally looking at explicitly nonlinear things, or better approaches at sparse coding (like FISTA or OMP or something equivalent that is nicely parallelised) and still using linear dictionaries

pallid current
#

have you looked at them at all?

bitter turtle
#

slightly but just to the point of 'aaargh this is a nightmare to get to run fast'

pallid current
#

i think they're good things to do at some point but unless they're super easy i'm leaning towards just running a really good sweep + auto-interp + additional analysis of the resulting data and aiming to publish based on that

bitter turtle
#

Yeah that sounds good I agree with that plan

pallid current
#

is there anything i can help with on the infra side today?

bitter turtle
#

Could implement tied weight stuff, better logging, saving to long term storage, that kind of thing? Also forgot but I am hosting a friend's birthday tomorrow I probably won't be able to work on it until sun

#

It's a bit cursed, sorry about that, but it was the least hacky way I could think to implement proper ensembling

pallid current
#

sure its better than whatever i'd have hacked up. tho i havent looked yet 😅

bitter turtle
#

yeah basically short explanation is

  • torch optimisers are not vectorisable
  • defaulted to not using autograd at all and instead used torchopt for stateless + vectorisable optimisers
#

i.e. basically Jax in pytorch

pallid current
#

time to scale up ! 🦾

coarse flint
#

Hey I am fairly new to interp stuff so sorry for asking but could this be dumbed down a bit: 'We give a feature an interpretability score by first generating a natural language explanation for the feature, which is expected to explain how strongly a feature will be active in a certain context, for example ‘the feature activates on legal terminology’. Then, we give this explanation to an LLM and ask it to predict the feature for hundreds of different contexts, so if the tokens are [‘the’ ‘lawyer’ ‘went’ ‘to’ ‘the’ ‘court’] the predicted activations might be [0, 10, 0, 0, 8]. The score is defined as the correlation between the true and predicted activations.'

#

I guess the main confusion I have is how you end up measuring the true activations of a feature and the predicted activations by the LLM based. Like what do these look like and how can you compare them? Maybe you've explained this already in the LW post and I've missed it.

keen pivot
# pallid current time to scale up ! 🦾

Ya, I'm trying to transition entirely to baukit for both training and interventions, which would be worth it for scaling to 66B parameters.

I can also handle the perplexity check & maybe work w/ someone to do the activation adding/subtracting part? There's team shard folks & nina (who also does SERI, maybe also turntrout's mentee?), who can work on activation engineering using our found directions. I'm reaching out to them now.

pallid current
#

oh, yeah i know nina, i liked the activation addition stuff she was doing

#

not, like, well, but she's probably just down the hall lol

keen pivot
pallid current
# keen pivot I would look at the previous posts for dictionary learning & openAI's blog post ...

yeah agree with logan that the best thing to do is to read the resources i linked in the sparse coding summary, i think people have found robert huben's explainer (https://www.lesswrong.com/posts/a4oPE4xJqkYSz6jMS/explaining-taking-features-out-of-superposition-with-sparse) the most helpful. the basic picture is that you learn a simple autoencoder where the activations in latent space of the autoencoder are the feature activation levels

[Thanks to Logan Riggs and Hoagy for their help writing this post.] …

bronze wraith
# bitter turtle <@360082080975290369> generally we should note that the L1 penalty basically sol...

Is it true that the EV of the L1 norms are minimized when the learned are aligned with the real features? I know it seems to work empirically, but in the one example I worked it doesn't pan out theoretically. The example: the real features are the standard basis vectors in R^2, so your data is sample uniformly from the unit square. Take two choices of learned dictionaries: the canonical 2-element dictionary {(0,1), (1,0)}, and a 3-element dictionary which is at 45 degree angles to the canonical one {(1,1), (-1,1), (1,-1)}. Both of these can learn a perfect reconstruction of the data, and when you learn that the 3-element dictionary actually has a lower L1 penalty term (by ~8%). I'm legitimately confused about why a rotated basis is optimal here, but the experiments seem to find the canonical basis. It might be some combination of 1) learning canonical features requires fewer features, 2) even if rotated beats canonical when constrained by perfect reconstruction, if you trade off with reconstruction error canonical is better, 3) its sensitive to sampling space, and something in the many correlated dimensions/activations matters, 4) canonical features are easier to learn for some reason, and the training gets stuck in this configuration which is "suboptimal" from a loss function perspective but optimal for our actual goals

bronze wraith
# keen pivot <@748975058415910923> , would you be able to think through the math of a tied em...

Had some thoughts about tied embeddings: if there is no bias term, they are piecewise-positive transformations. Meaning that when you partition the domain by the hyperplanes where the ReLU terms switch on/off, on each subset of the domain they are given by x-> (M^T)Mx, and the M matrix in each section will be the tied embedding matrix with rows zeroed out depending on which ReLU terms are active. Positive transformations are nice because they have an orthonormal basis with respect to which they are a diagonal scaling. However, I think the orthonormal bases of different pieces don't have to agree, nor do the bases have to align with the planes which separate the pieces. Finally, you can roll the bias term into the learned tied embedding in the usual way (by replacing the vector x=(x_1, ..., x_n) with (x_1, ..., x_n, 1)), and when you tie that into the matrix, you need to not score the L1 activation or the reconstruction loss of the last component, but otherwise you can store everything in a big tied matrix (sorry if thats unclear).

bitter turtle
pallid current
#

i think this might be because you need to rescale (1,1), (1,-1) and (-1, 1) to have unit norm

bronze wraith
# pallid current how does this work? this seems to me to be like the claim that you can do a shor...

Every representation has an advantage in encoding things closer to its basis vectors. For instance, with the 3-element dictionary, you have a lower L1 cost to represent (1,1), since you can take the (1,1) vector directly there (after normalizing that vector, you end up with a L1 cost of sqrt(2), in contrast to an L1 cost of 2 if you go along the canonical basis vectors). The canonical dictionary elements are more cost-effective near the axes, whereas the 3-element dictionary is more cost-effective near the line y=x. And given this sampling space (uniform across the unit square), the 3-element dictionary just barely squeaks it out.

bronze wraith
# pallid current i think this might be because you need to rescale (1,1), (1,-1) and (-1, 1) to h...

I was already including that in the calculation (I gave the unnormalized vectors for ease of writing, but I should have said that). To represent (x,y) with the standard vectors, your L1 cost is x+y, but to represent it with the 3-element dictionary the L1 cost is max(x,y)*sqrt(2). Here's a spreadsheet showing the 3-element dictionary having lower average L1 cost (sampling points from the unit square): https://docs.google.com/spreadsheets/d/1rsEbKy_16qwOGguw0Vbxf60Nogkmqmyw6961w4ytbfE/edit#gid=0

bronze wraith
bitter turtle
#

like p probability of being 0 otherwise uniform

#

I've set off a couple runs testing all combinations of the following parameters:

  • tied vs not tied
  • l1 coef \in [0.0031622776601683794, 0.01, 0.03162277660168379, 0.1]
  • bias l2 decay \in [0.0, 0.05, 0.1]
  • dict ratio \in [2, 4, 8]
#

should be good in a couple hours or so (guessing?)?

pallid current
#

if there was a high likelihood of only having X or Y active (which would happen if they were sparse) then you'd recover (0,1) (1,0) i think

bronze wraith
pallid current
#

if it's only a couple of hours, can you run the same sweep across all layers and mlp vs residual?

#

also i worry those l1 coefs might be a bit high, at least without any reinitialization

keen pivot
#

I would nail down the l1 first, which is typically the same across layers (though a unique l1 for mlp and another for residual)

#

Oh, I think the L1 is different for tied & untied. I agree that these l1 values are ~~too ~~a bit high.

#

For
MLP tied: 6e-4 (1e-4 is identity)
Residual tied: 3e-3 (which Aidan is indeed checking, though I'm unsure how the bias will interact)

bitter turtle
#

ah, ok!

#

I'll let this run conclude then do one looking for L1 values

#

also my time guess was off by a factor of 2, probably can rewrite for more speed somewhere

#

I think I'm bottlenecking on one GPU looking into it

keen pivot
#

It looks pretty rad so far. Thanks for working on it!

pallid current
bitter turtle
#

it's overusing 1 and it has to wait periodically to sync kinda

pallid current
#

doing some experiments to see how ability to use superposition varies with residual dimension

pallid current
#

lol i was gonna follow that message up with a graph but then i realised it had a big flaw 😅, hopefully will have something tomorrow

keen pivot
keen pivot
pallid current
#

bit odd that we're seeing periodic spikes, i guess at the beginning of a chunk..

pallid current
#

@bitter turtle is there any way of taking the saved dictionaries from your run yesterday back into the original class?

keen pivot
#

Getting some pretty bad perplexities for replacing the model w/ the dictionary reconstruction for pythia-70m-tied layer 3:

Dict Size | Perplexity | Reconstruction Loss
512: 180.98 0.0964
1024: 152.60 0.0870
2048: 127.85 0.0804
4096: 111.12 0.0763
8192: 104.80 0.0753
full model: 25.11 0.000

#

And the perplexity code is pretty simple, so I think I coded it right (& pythia 410m got ~11 perplexity, which makes directional sense)

pallid current
#

ok that's interesting. not super surprising though it would have been great if we didnt see this

#

which dicts are you using?

keen pivot
#

Oh, sorry I sent it then (jk jk)

keen pivot
#

Oh, it's actually layer 3, I can link the aws location. Any other identifying info?

pallid current
#

just seeing if they were the ones you trained or aidan's recent ones

keen pivot
pallid current
#

ok thats v interesting because just eyeballing some of aidans runs, we're seeing recon loss about an OOM lower

keen pivot
#

It is MLP, but I'm unsure what else would be different besides the bias l2 decay

pallid current
#

i don't know what would be different either (tho i think its possible bias l2 is actually v important for preventing dead feats) but the loss curves look waay more stable, i remember seeing that they seemed to plateau pretty hard and even start rising

keen pivot
#

Probably the untied parts? Nope, just checked & it's also low

#

It might just be the MLP. I expect reconstruction to hurt perplexity less in an MLP layer too, but I would definitely like this same run for residual stream!

#

Yo @bitter turtle , could you set off a run for residual stream? I can also look through your code and try if you're not able to.

pallid current
keen pivot
#

Aidan ran "big_sweep.py", and I think I just need to set "use_residual" to True

bitter turtle
keen pivot
pallid current
#

ok so reconstruction loss already seems to be like half of he ones you quoted above??

keen pivot
#

I haven't figured out the naming scheme for l1 values yet

#

Slight hitch: I didn't delete the activations_data folder, so it re-used that and used the mlp data & maybe even the mlp-sized model(?) Re-running

keen pivot
#

It ran for 7 times (out of 30 from the MLP) Then I got:
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

#

The reconstruction losses still look good, though I'm still confused on what's the l1 values correspond to.

shell mural
#

hey, this work looks unbelievably high value. I look forward to grokking it and seeing if there's any way i can contribute, even if it's just by getting more people looking at it

shell mural
bitter turtle
keen pivot
keen pivot
pallid current
#

instead of going through the hassle of matching encoder to cfg, this eve i ran an experiment suggested by @worldly hinge where i clustered the directions and then ran autointerp on the features in that cluster. ran on 10 clusters, 3 clusters just never activated, 1 seemed to consistently give v similar explanations ('hyphens') and the others were at least somewhat varied

#

most fun one was a cluster of number features ! :

421: 'instances of the number one.' 
559: "the digit '3' in the text."
1112: 'instances of the number 5.'
1459: 'numbers, particularly two-digit numbers where the second digit is a high number.' 
1503: 'numbers, particularly single digit numbers.' 
1744: ' numerical values, especially those used in a counting or sequence context.'
1954: 'the number 4 in the text.'
#

@keen pivot for the similar looking cluster, the hyphen one and also the one that centres around 'was'/'are'/'is' where there are some near identical autoexplanations i'd be interested to see if you find that they have distinct meanings

bitter turtle
#

Oh that's actually a sick idea

#

@pallid current did you manage to run it?

#

@keen pivot I'll rewrite the saving code today, so that it saves in a more intuitive format!

bitter turtle
bitter turtle
#

Or maybe 'what part of the unit sphere does each cluster enclose' or something

#

Just trying to get a feel for how spaced out they are

keen pivot
bitter turtle
#

right got better saving (it's seeminly very slow now for some reason?) should I run a residual stream run @keen pivot? What parameter settings?

keen pivot
bitter turtle
#

oh ok!

keen pivot
#

I think Hoagy wants an MLP one

bitter turtle
#

what layer?

#

residual will be shorter I'll do that first

keen pivot
#

I’m fine with layer 2

#

Unsure about Hoagy

keen pivot
bitter turtle
#

pahaha

#

Hmm I only get 8 chunks, I think we just might be reaching max data for pile10k?

#

@keen pivot can you remember how many chunks you normally get for residual stream?

keen pivot
#

We could do the pile’s first shard

bitter turtle
#

current format is each model is saved as a dictionary {"params": {"encoder": ..., "decoder": ..., ...}, "buffers": {...}}, and you can check hyperparameters in a JSON file hyperparams.json saved with the models @keen pivot

#

accidentally goofed, 10m

keen pivot
#

a problem (maybe): I’ve noticed a discrepancy between transformer lens and transformers library precision of Pythia model residual stream, which is unclear how much it’d effect our results: https://github.com/neelnanda-io/TransformerLens/issues/346#issuecomment-1641576171

GitHub

Describe the bug cache[f"blocks.{x}.hook_resid_pre"] doesn't match hidden states (or only up to a set decimal place). Hidden states is from transformer's model(tokens, output_hidd...

kind scroll
#

Lee pointed me here and I have a bit more time free to contribute. I plan to read up on your posts so far, but as I understand it, the high-level idea is to take a trained LLM which has features which may or may not be in superposition (have you thought of how you'd measure this?), and then training sparse autoencoders to recover features. Is that the gist of it?

#

If you're still thinking of trying a sparse VAE I'm happy to contribute there, too!

#
  • if there's currently any problems/challenges you have I'd love to hear about them
bitter turtle
#

@keen pivot should be in /mnt/ssd-cluster/resid_layer_2_19_07

keen pivot
# kind scroll Lee pointed me here and I have a bit more time free to contribute. I plan to rea...

Yep! I'm unsure how to measure the amount of superposition, but there's definitely feature packing (which may be superposition).

One argument: there's 50k vocab items which are then embedded into 500 dimensional space; probably feature packing there.

Another: optimality - more optimal to pack features as long you as you have sparse features.

Best argument: wes found features in superposition in his paper (neurons in a haystack)

bitter turtle
#

honestly don't see much point measuring it 'directly' since that would probably amount to 'see how well something shown to fit on superposed data fits on activation distributions' which is just training SAEs

#

but then again rough thoughts would be interested to hear proposals

kind scroll
#

I think the idea I had behind measuring it is more for scientific purposes - being able to qualitatively measure the degree of superposition would let you say a particular technique for sure takes something out of superposition, or reduces superposition by X quanta in this LLM. Then you'd also be able to look at, for example, how autointerp methods relate to superposition, and go on to using it in your model selection criteria

#

i.e. we've trained these two models but this one has higher superposition and we can't understand it as well, let's not deploy

#

(also spitballing here)

bitter turtle
#

I mean, afaict 'superposition' isn't a sufficently rigorous definition to be used like that; what we are really testing here isn't 'does a model do TMOS-style superposition' but rather 'is the abstraction detailed in TMOS a useful one'.

kind scroll
#

agreed

bitter turtle
#

Like you could measure 'max spike-and-slabbiness of the distribution of activations over rotations' but that feels unfounded and too-many-holes-having.

#

holes being degenerate cases

#

dunno why I said holes

#

anyway my take is that the metric for deployness or whatever will be something like 'how accurately can we abstractly describe functionality' or something which isn't neccesarily directly superposition-related. Like, I feel like if we can measure superposition there's a good chance that that mesurement method also has an unsuperpositonification mechanism as an immediate corollory

#

spelling

kind scroll
#

I think this is the most compelling real-world example of TMOS-style superposition I've seen so far which I'm sure you've seen too (if you have more please send!) https://distill.pub/2020/circuits/zoom-in/#claim-2-superposition. I think it's important because it relates inductive biases in model representations to downstream tasks, which is the real measure

#

(and I agree, I think superposition is a useful concept but doesn't directly relate to an atomic, measureable phenomenon)

bitter turtle
#

I'm not sure how strongly to take that evidence. I vaguely remember hearing somewhere that there is some nuance in how they generate those images. Also, that seems to mostly be entanglement (as in, viewing the activations in the 'wrong' basis) which is something you also see in e.g. VAEs

kind scroll
#

Point taken! Though wdym with entanglement?

bitter turtle
#

like, rotation as opposed to 'compression'

#

In general when I'm thinking about superposition I'm thinking about it more as a useful lens to view activations rather than something stronger and natural-abstraction-hypothesis-assuming

#

so like, stronger than Nora's views and less strong than Olah's I guess.

bitter turtle
#

Ok, I'm not sure how you'd measure that without also measuring superposition.

#

Could you expand more on what you envision?

keen pivot
#

One grounding of amount of superposition is how many features our dictionary learns w/ eps-diff in perplexity.

bitter turtle
#

perplexity? Like perplexity under intervention with reconstructed features?

#

Not sure how that would work; surely if there is some subspace our SAEs fail to describe then there is a minimum perplexity gain

keen pivot
bitter turtle
#

Phew

keen pivot
#

Is there a way to easily download the model given the .pt file?

bitter turtle
#

it's just saved as a dict

kind scroll
keen pivot
# bitter turtle it's just saved as a dict

I have torch.load()-ed it, but was hoping for a one-liner for
autoencoder = ...

atm, I can define an autoencoder and assign each relevant part to the part in the dictionary, which is doable, but I may be missing the intended way to load it.

kind scroll
#

I was referring there to the authors hypothesising that models find it useful to store some less-important features in superposition rather than dedicating e.g. an axis-aligned dimension to it

bitter turtle
#

I think people generally call non-axis-alignment 'entanglement' and the compression thing 'superposition', and also where are you referring to here? The distil.pub paper?

kind scroll
#

ye

keen pivot
kind scroll
#

I'm used to representation learning-lingo and there's lots of terms being used in the alignment space w.r.t disentanglement/representations that I'm not fully used to haha