Sparse Coding | EleutherAI | Page 2

pallid current Jun 27, 2023, 11:03 AM

#

oh duh lol soz

keen pivot Jun 27, 2023, 11:16 AM

#

Where’d you hear this?

pallid current Jun 27, 2023, 11:56 AM

#

ml engineer said it's a common thing, tryna find a paper or something that shows it but no luck

torn star Jun 27, 2023, 12:40 PM

#

It's because the model has already updated its weights based on this data. So you should expect it to be better.

pallid current Jun 27, 2023, 12:43 PM

#

torn star It's because the model has already updated its weights based on this data. So yo...

i mean sure, but the idea that there's carry over from like the first 1B tokens when the model is still only learning absolute basics, and which persists through the next like TB of training tokens is super surprising to me!

bitter turtle Jun 27, 2023, 1:10 PM

#

couple things,
is Z(i) supposed to be {x \in domain : ReLU(Wx + b) is 0 in the ith component}?
If you start with two features with Z(i)=Z(j) how do you get that the Z(i) are still disjoint from the domain?

bronze wraith Jun 27, 2023, 1:16 PM

#

bitter turtle couple things, is Z(i) supposed to be `{x \in domain : ReLU(Wx + b) is 0 in the ...

re 1: Z(i) is the hyperplane in which the ith relu switches over from activating to not activating, and its useful to talk about that in general, even if it doesn't intersect the domain of reconstruction
re 2: if Z(i)=Z(j), it may be the case that the Z(i) intersect the domain. however, I conjecture that when this happens, you can "cancel" these features and improve your L1 loss term. As a trivial example: take any reconstruction, then add on two features f_1 and f_2 which are negatives of each other, and which activate equal amounts (ie their rows in the W and b matrices are equal, but their rows in the D matrix are negatives). These have the same Z(i) planes, which can be anywhere, and they completely cancel out in the reconstruction, but they add unnecessarily to the L1 loss on the c term, so they should be trained out

bitter turtle Jun 27, 2023, 1:18 PM

#

cool that clarifies a bunch thanks

bronze wraith Jun 27, 2023, 1:49 PM

#

Okay, bad news: I found a simple example where you can get a perfect reconstruction with better L1 loss by using non-canonical features than by using canonical features.
The example:

Sample x and y coordinates iid from the uniform distribution [0,1], so your domain is the unit square in R^2 (and thus your canonical features are (0,1) and (1,0)). Rediscovering those features, you get a perfect reconstruction, and the L1 loss at (x,y) is |x|+|y|, which has an EV of 1 over this distribution of points.
Alternatively, learn these weight and bias matrices:
W = 1/sqrt(2) [1, 1 ]
[1, -1]
[-1, 1]
b=0
D= W^T (the transpose)
Then this also gives a perfect reconstruction, and L1 loss at (x,y) is max(x,y)/sqrt(2), which has an EV of 2sqrt(2)/3≈.942<1 over this distribution of points.

#

(This is basically Hoagy's example from before. Note that my previous theorem doesn't apply here, because Z(2)=Z(3))

pallid current Jun 27, 2023, 2:07 PM

#

got autointerp working with the nano model on ICA (which works well) and random directions (which doesnt)

bronze wraith Jun 27, 2023, 2:16 PM

#

bronze wraith Okay, bad news: I found a simple example where you can get a perfect reconstruct...

More generally, if you have an invertible matrix M, and W consists of M stacked on top of -M, and D consists of M^-1 stacked on -M^-1, that also gives a perfect reconstruction. This works because you're "cheating" the ReLU via x=ReLU(x)+ -1*ReLU(-x).
However, this can be detected by looking at minimum cosine similarities between the features in the dict. If the minimum cosine similarity is -1, that means this sort of cheating is happening.

bronze wraith Jun 27, 2023, 2:32 PM

#

bronze wraith More generally, if you have an invertible matrix M, and W consists of M stacked ...

Okay, I ran this test on the data I had from Hoagy previously, and this MCS=-1 features aren't showing up. That's one bullet dodged!

keen pivot Jun 27, 2023, 4:03 PM

#

pallid current got autointerp working with the nano model on ICA (which works well) and random ...

You got details? & ICA is the baseline to compare the dictionary with?

pallid current Jun 27, 2023, 4:13 PM

#

keen pivot You got details? & ICA is the baseline to compare the dictionary with?

no details yet, just running on 10 features, and also i totally bodged the random directions calculation the first time (it was just analysing random noise lol) , it's weirdly good now!

#

i think over the 10 features, random might actually have scored the best.. 🤔 🤔 small sample size etc but still.. odd. i would expect it to still do quite well at finding a pattern in top activations, but would expect very poor scores in the random part of top-and-random scoring. need to make sure that random part is actually happening

#

if random isn't low then it puts into question the validity of finding additional interpretable features with sparse coding. might need to move straight to pythia

upper basin Jun 27, 2023, 4:21 PM

#

keen pivot Where’d you hear this?

I think we have a few pictures of it happening around here somewhere

keen pivot Jun 27, 2023, 4:33 PM

#

pallid current if random isn't low then it puts into question the validity of finding additiona...

W/ pythia, I could plot the top activating examples for a random direction in neurons to check if it looks interpretable.

keen pivot Jun 27, 2023, 6:05 PM

#

@bronze wraith I need your math help. If I want to ablate a feature, it makes sense to ablate the direction specified by the decoder; however, for the encoder, there's a negative bias, ReLU & mostly-postively neurons (because of GeLU in MLP).

This means there isn't a "neuron direction" for the encoder. Ex. If I have a bias of -1, 2 positive neurons, then (2,0), (1, 1), (0,2) all activate it, but this isn't a direction (or vector I can ablate)

bronze wraith Jun 27, 2023, 6:10 PM

#

keen pivot <@748975058415910923> I need your math help. If I want to ablate a feature, it m...

There is at least a direction specified by the encoder, right? I think in the example you're giving, the neuron activation is 1*x+1*y-1, so in the encoder you'll see a row that is [1,1]. Is that sufficient for your purposes?

keen pivot Jun 27, 2023, 6:13 PM

#

bronze wraith There is at least a direction specified by the encoder, right? I think in the ex...

There's a direction specified by the encoder, but I'd like to force the neurons to never enter a possible configuration that causes that feature to activate w/o ruining other things.

In the example, I think it'd be sufficient to project along the line: -x + 1.

bronze wraith Jun 27, 2023, 6:26 PM

#

keen pivot There's a direction specified by the encoder, but I'd like to force the neurons ...

What exactly is the goal here? What you describe should work for a single feature, but if your features aren't orthogonal, those kinds of projections may not commute.

keen pivot Jun 27, 2023, 6:43 PM

#

I'd like to see the causal role of a feature. One way is to ablate it.

My current method is to subtract by (feature vector*feature activativation) w/ vector specified by the decoder)

#

This may just be the best method, but it does rely on the reconstruction being high-quality, which I haven't checked yet.

bronze wraith Jun 27, 2023, 6:47 PM

#

keen pivot I'd like to see the causal role of a feature. One way is to ablate it. My curr...

and what are you measuring after removing the feature? the reconstruction quality?

keen pivot Jun 27, 2023, 6:48 PM

#

Logit difference in output true tokens

#

Input: vol. 1 at www.boost.com
feature activates on "www"
Effect on output: Lowers the logit assigned to token "boost"

keen pivot Jun 28, 2023, 4:22 AM

#

While I'm doing this ablation stuff, I'm going to investigate just training on lots of data over lots of epochs.

pallid current Jun 28, 2023, 1:43 PM

#

anyone know anything about using pandas with big data? main bottleneck atm is that it's taking outrageously long amount of time to either save df to csv, or to convert df columns which are lists into strings which then save fast

torn star Jun 28, 2023, 3:35 PM

#

pallid current anyone know anything about using pandas with big data? main bottleneck atm is th...

A few years ago I recall having issues with this, I think I ended up just converting to numpy arrays. Here are some GPT4 suggestions

Screen_Shot_2023-06-28_at_4.33.31_pm.png

pallid current Jun 28, 2023, 4:24 PM

#

got some results comparing neuron basis and random transform on the tiny model and got this 😮

#

basically i think this means that there's no use doing work on the nano model

#

unless i've bodged something, which is obv v possible

#

replacing with graph with means

#

#

these are autointerp scores using gpt4

bitter turtle Jun 28, 2023, 4:37 PM

#

eek

pallid current Jun 28, 2023, 4:38 PM

#

I'll run the same test on Pythia tomorrow and see what we get, I don't expect the same results

keen pivot Jun 28, 2023, 5:54 PM

#

pallid current got some results comparing neuron basis and random transform on the tiny model a...

I would definitely look at specific examples that activate high MCS features and those that activate random ones.

At least for Pythia, they look quite monosemantic and I could explain them!

bitter turtle Jun 28, 2023, 5:56 PM

#

Ok so i ran a test on non-repeated data on the residual stream of pythia and hesitantly I don't see the MCS dropoff you see when training on the post-activation-in-mlp dataset

#

mini-batch increases as you go rightwards. each mini-run is ~2M datapoints = 2GB data (residual stream width = 512)

#

#

also running a repeated-data run atm for comparison

keen pivot Jun 28, 2023, 6:01 PM

#

I believe I was running at 3e-4

#

You seem to be getting best results at 1e-3

#

Is this the default layer 2 for Pythia?

bitter turtle Jun 28, 2023, 6:02 PM

#

residual stream post-computation at layer 2

#

so like, data coming out of layer 2/into layer 3

keen pivot Jun 28, 2023, 6:03 PM

#

Okay, so this is great. Like if the only reason it didn’t scale is because I picked an awful l1 value, then we’re good!

bitter turtle Jun 28, 2023, 6:04 PM

#

I'm very uncertain about this, but I don't think we should expect a priori for SAEs to work on MLP-post-activation data, TMOS assumptions might not hold well and SAEs are fairly heavily predicated on those to work

keen pivot Jun 28, 2023, 6:05 PM

#

What are the assumptions?

#

Also, it’s very cheap to just run two experiments on the pre and post

bitter turtle Jun 28, 2023, 6:06 PM

#

just the whole 'data mostly explainable as a bunch of sparsely-activated features which interfere slightly' thing

keen pivot Jun 28, 2023, 6:06 PM

#

It’s in the utils.py file

#

I can run a few to check if there’s a better l1 than 1e-3 for both pre and post

#

And which one is better.

#

Would still be interested in links for SAE assumptions, but again the test is pretty cheap! (a couple hours run)

bitter turtle Jun 28, 2023, 6:09 PM

#

keen pivot Also, it’s very cheap to just run two experiments on the pre and post

I meant around MLP layers at all; I don't have a clear, complete model of how MLPs in transformers actually interact with said sparsely-activated features in the residual stream @pallid current wanted to test this

#

same thing for repeated data; it's pretty much the same afaict

#

bitter turtle Jun 28, 2023, 6:14 PM

#

keen pivot I can run a few to check if there’s a better l1 than 1e-3 for both pre and post

yeah I think pre would work better, but probably still worse than the residual stream by a little bit

#

in general I think we should be looking at the residual stream

keen pivot Jun 28, 2023, 6:15 PM

#

The residual stream would also be cheap, though harder to compare apples to apples with the MLP

bitter turtle Jun 28, 2023, 6:18 PM

#

but like, I don't see why we should expect SAEs to work at all on the MLP since I don't think the hypothesis 'MLP activations look like a sparsely activated overcomplete feature set' is particularly good or true, nor do I see why we should expect that a priori

#

like, obviously it has to have some similarities since it is writing to/from a sparse overcomplete feature set (the residual stream) via a linear map but I can't tell how similar it should be

#

it's also not clear to me why we should expect the MLPs to 'output' significantly more features than the dimensionality of the MLP

#

Might get to work implementing @pallid current's toy model tomorrow

keen pivot Jun 28, 2023, 6:32 PM

#

I'm currently replicating your results & gonna try to find a better l1 value if it exists, & train on that for lots of data.

Additionally, will run one on the residual stream

bitter turtle Jun 28, 2023, 6:34 PM

#

Also I think this is kind of max-data for pile-10k for residual stream activations (when it's truncated to 256 tokens/line + standard max-senences or whatever, I don't really know how the dataset is configured)

#

Like I got 8x2GB chunks out of it max

#

you could mess around with the truncation or do a different dataset

#

(since the residual stream dataset should be 1/4 the size (in bytes) of the activation dataset @keen pivot)

keen pivot Jun 28, 2023, 6:46 PM

#

bitter turtle (since the residual stream dataset should be 1/4 the size (in bytes) of the acti...

To make it work w/ residual stream, you have to change both cfg.mlp_width & the tensor_name for transformer lens. (Both in utils.py under make_tensor_name)

keen pivot Jun 28, 2023, 6:48 PM

#

bitter turtle

Ah, we're one l1-alpha value off.

bitter turtle Jun 28, 2023, 6:50 PM

#

keen pivot To make it work w/ residual stream, you have to change both cfg.mlp_width & the ...

I did this

keen pivot Jun 28, 2023, 6:50 PM

#

Oh I understand everything now

#

Sorry

bitter turtle Jun 28, 2023, 6:50 PM

#

moderately confused

keen pivot Jun 28, 2023, 6:51 PM

#

I thought you were doing mlp the whole time. I just mis-read your message initially.

#

Replicating now: Residual stream:

#

I do think there may be a slighly better l1 value. The sparsity of 1e-03 is ~0 and 3e-3 is ~20 (so 20 features per datapoint on average, which still seems high)

#

Looks like 3e-3 is great! I can a more fine-grained experiment later if it should be 2 or 4, but mostly looks good. Running a larger run w/ larger dicts on that l1-value.

keen pivot Jun 28, 2023, 8:58 PM

#

@bitter turtle There is a degradation in residual stream (top one is later). This is after 8 epochs of 8 chunks (=16GB) each w/ refreshed data every time

#

Also, I'd expect it to not really plateau here at 30% of 512 (the residual stream size), because the toy model didn't. But this may again be a data diversity thing, or a "we should train it on multiple epochs" thing, or something else.

bitter turtle Jun 28, 2023, 9:42 PM

#

keen pivot <@332271551481118732> There is a degradation in residual stream (top one is late...

I think the data is repeated here. Afaict, you can only get 16GB of 512-dim activations from pile10k with our current data processing setup

bitter turtle Jun 28, 2023, 9:42 PM

#

keen pivot Also, I'd expect it to not really plateau here at 30% of 512 (the residual strea...

Yeah, what did the toy model plateau at?

#

Also, I used batch-size 1024

keen pivot Jun 28, 2023, 9:44 PM

#

bitter turtle I think the data is repeated here. Afaict, you can only get 16GB of 512-dim acti...

I'm doing just pile, which ~~should be default.~~ nope. pile-10k is default. Thanks for mentioning this!

keen pivot Jun 28, 2023, 9:45 PM

#

bitter turtle Yeah, what did the toy model plateau at?

Usually ~1, but sometimes lower for larger sizes (I vaguely remember 80%, but never something like 30% which is what we're getting. Additionally, the toy may just be caused by the way we scale feature frequency which would make additional features even less likely)

bitter turtle Jun 28, 2023, 10:12 PM

#

keen pivot I'm doing just pile, which ~~should be default.~~ nope. pile-10k is default. Tha...

Yeah pile is fucking big like the first shard is 500GB of text

keen pivot Jun 28, 2023, 10:15 PM

#

I may or may not be downloading it on 4 different rented GPU's to learn dicts for all Pythia layers.

#

I am currently committing to the MLP post activation for Neel stuff, but next week, I'd like to investigate more into residual stream if you haven't solved it all yourself by then!

bitter turtle Jun 28, 2023, 10:16 PM

#

keen pivot <@332271551481118732> There is a degradation in residual stream (top one is late...

I'm also not sure how significant the 2nd SF is.

keen pivot Jun 28, 2023, 10:17 PM

#

bitter turtle I'm also not sure how significant the 2nd SF is.

What's 2nd SF? Also this was pile-10k I think, so it was just the same 16GB

bitter turtle Jun 28, 2023, 10:18 PM

#

What I meant is I'm not sure how significant a change the 2.95 to 2.91 is

bitter turtle Jun 28, 2023, 10:19 PM

#

keen pivot Also, I'd expect it to not really plateau here at 30% of 512 (the residual strea...

Anyway wrt this I'm going to try and get these results with synthetic data and see what happens with that

keen pivot Jun 29, 2023, 1:58 AM

#

Notice something interesting. It seems the best l1 value changes. Here the top one is later & 3e-5 appears to do better than 3e-4 (which starts to degrade/stagnate). This is running on 20GB of pile for Pythia-layer-4, repeated for 5 mini-runs. The above is between mini-run3 & 4 (w/ 0 indexing).

#

This is important because one way to pick an l1 value is to just run it & see which one does better, but here, you don't see it until 20GB*5 repeats in! This replicated across layers 1 & 5 (I'm still waiting on layer 3, which is just taking 2x as long as the others 🤷) Edit: Layer3 also has the same behavior.

#

In response, I'm running them all again for 60GB repeated 5 times (3x the GB) to see if optimal L1 shifts any more left. The idea here is to just spend a lot of compute & maybe backtrack how we could determine optimal L1 from other corollaries.

#

I'm also intending on saving to aws every mini-run, so any of y'all can see the dict. Additionally, we can check MCS between dicts at different L1's to see if there is a better l1 value for every size dict.

#

Thanks @pallid current for coding up the mini-run code, wandb, & saving to aws. You're amazing!:)

keen pivot Jun 29, 2023, 4:21 AM

#

#

Super weird jump to 70% of 2k features in layer 1. That's the most features I've seen.

#

Just wait until I wake up & these runs are done. This isn't even my final form.

pallid current Jun 29, 2023, 8:06 AM

#

bitter turtle Ok so i ran a test on non-repeated data on the residual stream of pythia and hes...

those look really good! looks worth really cranking up the n_mini_runs, and also the dict size and possibly the l1 if it's not a problem at the highest val you tested

#

i can try testing those runs on our new stronger cluster soon, though need to write some parallelism code.

#

@bitter turtle could you add a PR to add an option to use residual stream please?

pallid current Jun 29, 2023, 8:12 AM

#

bitter turtle but like, I don't see why we should expect SAEs to work _at all_ on the MLP sinc...

here's a recent paper which uses sparse probing on the MLP output to find basic features, i think this is the closest thing to motivation for the MLP work that i know of https://arxiv.org/abs/2305.01610

arXiv.org

Finding Neurons in a Haystack: Case Studies with Sparse Probing

Despite rapid adoption and deployment of large language models (LLMs), the
internal computations of these models remain opaque and poorly understood. In
this work, we seek to understand how high-level human-interpretable features
are represented within the internal neuron activations of LLMs. We train
$k$-sparse linear classifiers (probes) on th...

pallid current Jun 29, 2023, 8:35 AM

#

bitter turtle yeah I think pre would work better, but probably still worse than the residual s...

pre is just a rotation of a subspace of the residual stream so i agree residual stream > pre_non_lin but i still think there's good reason this might work on postnonlin. those recent results make it look like we should focus on residual stream

#

if we're doing that we should also really think about how what we're doing builds on, rather than just replicates, the transformer factors paper that i'm always posting https://arxiv.org/abs/2103.15949

arXiv.org

Transformer visualization via dictionary learning: contextualized e...

Transformer networks have revolutionized NLP representation learning since
they were introduced. Though a great effort has been made to explain the
representation in transformers, it is widely recognized that our understanding
is not sufficient. One important reason is that there lack enough visualization
tools for detailed analysis. In this pap...

pallid current Jun 29, 2023, 8:41 AM

#

bitter turtle Yeah pile is fucking big like the first shard is 500GB of _text_

it's big but not that big, the first shard is more like 30GB

pallid current Jun 29, 2023, 8:51 AM

#

keen pivot

wtf are these results?? 70% in 2k but 0 in 4k?

keen pivot Jun 29, 2023, 9:15 AM

#

pallid current wtf are these results?? 70% in 2k but 0 in 4k?

RIGHT!???

keen pivot Jun 29, 2023, 9:17 AM

#

pallid current wtf are these results?? 70% in 2k but 0 in 4k?

Lololololol

#

This is layer 1 though. I've normally been doing layer 2, so I don't know the effect there.

#

Layer 4

#

Layer 5

pallid current Jun 29, 2023, 9:19 AM

#

waaaaait, is the MMCS where you remove the high MCS feats screwing something

#

like you've removed all the feats in the 4K dict?

keen pivot Jun 29, 2023, 9:20 AM

#

This is just our normal code

pallid current Jun 29, 2023, 9:20 AM

#

yaaaaa but like, no way that first graph you sent is right

#

suuuurely not

keen pivot Jun 29, 2023, 9:22 AM

#

I do do (lol) 1024 batch_size

#

So 🤷

pallid current Jun 29, 2023, 9:22 AM

#

ok i see you're not like removing the feats from the dict in the MMCS code (i dont understand the hungarian thing)

#

but like......... wtf, can we see the histograms?

keen pivot Jun 29, 2023, 9:23 AM

#

pallid current ok i see you're not like removing the feats from the dict in the MMCS code (i do...

hungarian thing just does the best matching thing given all vectors Cosine_sim

#

Layer 4:

#

Layer 1:

#

Like, I think this works then. Of course I need to do more checks on the actual features, and we'd need to figure out how to get consistant, but like: just use a lot of compute to overcome our ignorance for proof-of-concept

#

Layer 3 is still like veeeery slow. I'll let it finish out, but man, it's only done 2/5 mini-runs & everything else is done:

pallid current Jun 29, 2023, 9:29 AM

#

keen pivot Like, I think this works then. Of course I need to do more checks on the actual ...

lol i dont know how you can look at that and think it works, to me it screams that something is broken

#

what do the other metrics look like for the 8k and 16k dicts?

keen pivot Jun 29, 2023, 9:29 AM

#

pallid current lol i dont know how you can look at that and think it works, to me it screams th...

Just unstable

pallid current Jun 29, 2023, 9:30 AM

#

so that's 2000 high MCS feats for a 2k dimensional MLP?

keen pivot Jun 29, 2023, 9:31 AM

#

pallid current what do the other metrics look like for the 8k and 16k dicts?

which ones? I also think I overwrite our normal variables using wandbd 😢

#

Like every mini-run

keen pivot Jun 29, 2023, 9:36 AM

#

pallid current so that's 2000 high MCS feats for a 2k dimensional MLP?

Yep! & The 4k might have more! I'm saying it might be unstable because the 4k relies on the 8k to be good, and the larger dict just dies? Ya still so weird

pallid current Jun 29, 2023, 9:37 AM

#

yeah agreed

bitter turtle Jun 29, 2023, 10:19 AM

#

keen pivot Notice something interesting. It seems the best l1 value changes. Here the top o...

Given how noisy this is, I don't want to update too heavily on this. We should probably do a bunch of runs and average them. Also, yeah, didn't Dan say something about annealing L1 over time?

bitter turtle Jun 29, 2023, 10:20 AM

#

keen pivot

Which one is which?

bitter turtle Jun 29, 2023, 10:28 AM

#

pallid current lol i dont know how you can look at that and think it works, to me it screams th...

Real

keen pivot Jun 29, 2023, 10:31 AM

#

bitter turtle Which one is which?

This one is the first two mini_runs for layer 1

bitter turtle Jun 29, 2023, 10:31 AM

#

Left is run 1, right is run 2?

keen pivot Jun 29, 2023, 10:32 AM

#

bitter turtle Left is run 1, right is run 2?

Sorry. Left is later (run 2, though technically 1 w/ 0-indexing)

keen pivot Jun 29, 2023, 10:33 AM

#

bitter turtle Given how noisy this is, I don't want to update too heavily on this. We should p...

Oh ya, it'd unclear which l1 value is ultimately best & how to determine that w/o running all the experiments.

#

One thing to note w/ residual & MLP is the sparsity of the MLP w/ tiny l1s (where layer 1 did best) is ~800 or ~400 for 2k & 4k respectively. Sparsity is calculated as average features per token. That's just crazy. Not real.

Residual stream has like 2-3, lol

pallid current Jun 29, 2023, 10:42 AM

#

high features per token is (unsurprisngly) connected to the low l1 val, it goes way lower for the higher l1 vals

keen pivot Jun 29, 2023, 10:46 AM

#

Ya, but like, high MCS? So there's at least a converged way to learn to "sparsely" reconstruct between the 2k & 4k dictionaries.
Edit: They could both learn the identity. Expecially w/ such a low l1 value. I can check this!

#

Also, hoagy, you know how to download aws bucket stuff from command line? I tried
aws s3api get-object --bucket DOC-EXAMPLE-BUCKET1 --key dir/my_images.tar.bz2 my_downloaded_image.tar.bz2, equivalent for mine, but it gave me a Syntax error in their code 😢 (their's does python2).

I can look it up more later (atm just manually downloading locally, then uploading); not a bottleneck!

pallid current Jun 29, 2023, 11:50 AM

#

keen pivot Also, hoagy, you know how to download aws bucket stuff from command line? I trie...

no, whenever i've done it i do it via my python utils (can just spin up a python instance)

pallid current Jun 29, 2023, 2:19 PM

#

i also ran 50 auto interps random vs neuron_basis on pythia70M, still no distinction 🤔

#

now very confused, might be a layer thing, might be a bug i dunno

#

anyone got a good recent pythia sparse coding run to do a comparison to?

pallid current Jun 29, 2023, 2:44 PM

#

honestly im so confused by the recent set of results, i'd really like to talk it through with someone soon

bitter turtle Jun 29, 2023, 2:58 PM

#

pallid current anyone got a good recent pythia sparse coding run to do a comparison to?

i've a couple for the residual stream on aws

bitter turtle Jun 29, 2023, 3:20 PM

#

@pallid current I've submitted a PR for vectorised ensemble training for SAEs, should be more performant and gpu-utilising, the merge looks a bit horrific, I'll let you deal with that 😬

#

I'm also going to do the transformers toy model in a different repo, i cba to deal with the merge conflicts atm

#

@keen pivot if you have the energy to change the training loop code (I don't unfortunately), the vectorised code should be a lot better for hyperparameter tuning

keen pivot Jun 29, 2023, 3:29 PM

#

bitter turtle <@360082080975290369> if you have the energy to change the training loop code (I...

I definitely can't this week, but it looks good!

bitter turtle Jun 29, 2023, 6:36 PM

#

Hi @keen pivot @bronze wraith @pallid current, I think we should have a chat about standardising some metrics at some point in future (I want to establish some sort of principled way to reason about goodness-of-extraction-procedure, and maybe implement a testbed for comparing extraction procedures robustly); also @pallid current said something about standardising PRs, tests, etc

keen pivot Jun 30, 2023, 1:06 AM

#

Found a good example to show the "features get lost over time" thing

bitter turtle Jun 30, 2023, 2:01 AM

#

What's this on?

keen pivot Jun 30, 2023, 3:01 AM

#

Layer 1 Pythia. 20GB*20 times (fresh data though)

bitter turtle Jun 30, 2023, 10:54 AM

#

MLP?

pallid current Jun 30, 2023, 2:42 PM

#

@keen pivot what's your situation with neel and final projects?

#

i think we should do a meeting on monday to plan what we need to do next because it's starting to feel scattered again

#

my tentative plan is that we should identify all of the variables of interest, do a bit of work to make the code more parallelizable, and then use the eleutherAI cluster to do a much broader sweep than we've done so far, so that we can have a single pile of data to look at and draw conclusions from (and then presumably branch off a bit)

keen pivot Jun 30, 2023, 3:03 PM

#

pallid current <@360082080975290369> what's your situation with neel and final projects?

I’ve got a few awesome features to show and a few failures (like logit lens on feature direction)

#

Will be done today (interview is tomorrow, hear back on acceptance Monday)

pallid current Jun 30, 2023, 3:04 PM

#

good luck!

bitter turtle Jun 30, 2023, 6:36 PM

#

pallid current i think we should do a meeting on monday to plan what we need to do next because...

I might be a bit pressed for time on Monday (moving house) might be available in the evening tho

bitter turtle Jun 30, 2023, 6:36 PM

#

keen pivot I’ve got a few awesome features to show and a few failures (like logit lens on f...

Oh, what happened with logit lens on feature direction?

keen pivot Jun 30, 2023, 6:38 PM

#

bitter turtle Oh, what happened with logit lens on feature direction?

It usually just shows nonsense tokens. Tuned lens might help?

Ablating that feature direction does tend to have a meaningful affect (e.g. ablating the "www." feature affects urls)

bitter turtle Jun 30, 2023, 6:39 PM

#

On gpt-2 or pythia?

keen pivot Jun 30, 2023, 8:01 PM

#

bitter turtle On gpt-2 or pythia?

pythia

keen pivot Jun 30, 2023, 9:15 PM

#

I think I'm done. I can't make changes past 9pm my time, but feel free to look at it!
https://docs.google.com/document/d/1KqHe9NL9NuJ_yaKJc__eX6kjBtRcROePoLApjOhGBUU/edit?usp=sharing

Google Docs

Interpreting Sparse Dictionaries

Interpreting Sparse Dictionaries Sparse dictionaries can learn meaningful features, w/ max cosine similarity (MCS) correlating w/ meaning & monosemanticity. Note: MCS is between two different dictionaries (ie rows in the decoder’s linear layer) w/ the intuition being if two dicts learn the same ...

keen pivot Jul 1, 2023, 9:33 AM

#

Okay, I've got a few sources of evidence that the Pythia dictionary that learned ~2k features actually did converge on identity:

Several features do look polysemantic like normal neuron interp
Looking at one feature, it only has 1 neuron that activates >0.3 for the same max-activating datapoints & 1 outlier positive weight in the decoder (ie activating it only reconstructs one neuron)
The L1 was way lower than expected
Sparsity was crazy high (400 or 800 features/datapoint)

keen pivot Jul 1, 2023, 10:51 AM

#

@pallid current @bronze wraith @bitter turtle I'm getting meaningful, monosemantic features for even low-MCS features in Pythia-mlp-layer-1. This was trained on 60GB*5 (repeating data). For MCS-above 0.9, it went from 55% to 45% during training, but looking at several features, they all seem meaningful.

pallid current Jul 1, 2023, 10:55 AM

#

Oh sweet, if you send me the dict I'll run a comparison on autointerp with random and neuron badis

keen pivot Jul 1, 2023, 10:55 AM

#

How many GB is pile-10k?

pallid current Jul 1, 2023, 10:55 AM

#

In activations?

keen pivot Jul 1, 2023, 10:57 AM

#

I think. I want to compare when I do n-chunks=30 for Pile, I think that's 60GB.

keen pivot Jul 3, 2023, 6:22 PM

#

pallid current my tentative plan is that we should identify all of the variables of interest, d...

@bitter turtle @bronze wraith , We could have a voice call again tomorrow, same time as last (GMT-16:00, 12pm eastern, 5pm UK)

I can write a few topics-to-discuss beforehand, so the call is short & useful. Please include what you'd like to talk about!

#

From Hoagy earlier:

my tentative plan is that we should identify all of the variables of interest, do a bit of work to make the code more parallelizable, and then use the eleutherAI cluster to do a much broader sweep than we've done so far, so that we can have a single pile of data to look at and draw conclusions from (and then presumably branch off a bit)
Notably, we do already have a few dictionaries on the bucket to look at now, which may inform what type of information we're missing (to code and get for the next iteration of dicts)

bitter turtle Jul 3, 2023, 6:25 PM

#

5pm is maybe a little early for me unfortunately, how is GMT-17:00?

keen pivot Jul 3, 2023, 6:31 PM

#

For me, top of my head:

Have evidence that the features we learned are indeed meaningful (intend to write an LW post)
Evidence that dicts on pythia-layer-1 learns meaningful features for even low-MCS

Proposal: Can look at dicts that go low-to-high-to-low MCS, see if it ends w/ meaningful features & what happens

General Proposal: Look at "features across checkpoints" pythia style, w/ more frequent checkpoints earlier. I'm suggesting spending ~20 minutes/feature (after time spent writing functions) .

bitter turtle Jul 3, 2023, 6:35 PM

#

How are you currently evaluating the meaningfulness of features (in the case without autointerp)?

keen pivot Jul 3, 2023, 7:56 PM

#

bitter turtle How are you currently evaluating the meaningfulness of features (in the case wit...

Like that doc I sent earlier, let me re-link: https://docs.google.com/document/d/1KqHe9NL9NuJ_yaKJc__eX6kjBtRcROePoLApjOhGBUU/edit?usp=sharing

Google Docs

Interpreting Sparse Dictionaries

Interpreting Sparse Dictionaries Sparse dictionaries can learn meaningful features, w/ max cosine similarity (MCS) correlating w/ meaning & monosemanticity. Note: MCS is between two different dictionaries (ie rows in the decoder’s linear layer) w/ the intuition being if two dicts learn the same ...

#

I mainly focused on gathering a lot of information that helps narrow down the hypothesis of what property of inputs that feature is activating on.

This can be repeated for the output (as in, we think the model has developed a discriminator for property X, which was useful for predicting Y, so gain info & test for that), as well as intermediate layer's features.

#

Btw. the doc is like 15 pages, mostly images, only 1k words. I can answer any questions about it!

bronze wraith Jul 3, 2023, 10:56 PM

#

Hi @keen pivot , read the doc, very cool! Some thoughts:

I had a bit of trouble following what was being shown in each image. Captions might help?
One ways we could test robustness of our understandings of the features: after getting the text description (e.g. "period after www"), we/GPT write new text we think will activate the neuron, then run it through the LM again and test if we can activate the neuron on demand.
I'm worried by the fact that our features are ~half punctuation/math/urls, it makes me think we're picking up just a certain kind of neuron, but "useful" semantic meaning is either not in this part of the LM or is less able to be found by our method.

keen pivot Jul 3, 2023, 11:38 PM

#

bronze wraith Hi <@360082080975290369> , read the doc, very cool! Some thoughts: 1. I had a bi...

Thanks Robert!

Agreed. Writing a post and clarifying pics better currently.
This is covered in the “created examples” and the token search section
This is Pythia-70M in an earlier layer, so maybe it doesn’t have such high level features. I could train one on mid layer 13 Billion though

pallid current Jul 4, 2023, 12:51 PM

#

from looking at the neuron-in-haystack paper (https://arxiv.org/pdf/2305.01610.pdf) they also used pythia70M and found a lot of sparse, meaning-related action in the early layers, exactly the kind of thing that we'd want to be able to find

#

so i dont think it's just a scale issue

keen pivot Jul 4, 2023, 1:16 PM

#

pallid current so i dont think it's just a scale issue

maybe a skill issue instead?

keen pivot Jul 4, 2023, 1:19 PM

#

pallid current from looking at the neuron-in-haystack paper (https://arxiv.org/pdf/2305.01610.p...

Can you give an example?
If it's:

but neuron 1B.L6.N3108
activates on return if and only if it is in the context of Go code
I found two like this, w/ one being the opening $ in latex

pallid current Jul 4, 2023, 1:20 PM

#

keen pivot Can you give an example? If it's: > but neuron 1B.L6.N3108 > activates on retur...

looking at section 5.1

keen pivot Jul 4, 2023, 1:21 PM

#

Like general french neuron or Go (programming language) neuron?

keen pivot Jul 4, 2023, 1:24 PM

#

pallid current looking at section 5.1

5.1 is compound words, which were also found (but not necessarily the same ones)

pallid current Jul 4, 2023, 1:41 PM

#

keen pivot 5.1 is compound words, which were also found (but not necessarily the same ones)

which ones are you thinking of? my general feeling looking at that is that they're showing superposition at a much much more granular level, whereas the things we're finding, or at least managing to interpret are much broader

#

i'm wondering whether we should use the neuron in a haystack approach to find some directions which we reckon are exactly the kind of thing we would hope to find, and try to check that sparse coding is actually something that's capable of finding it

keen pivot Jul 4, 2023, 1:53 PM

#

pallid current which ones are you thinking of? my general feeling looking at that is that they'...

https://docs.google.com/spreadsheets/d/1p1Wu4vJ1fKYsMtjrXFboQpIOl_sKue-dgabFt2EYw0s/edit?usp=sharing

Google Docs

Neurons

Top MCS

neuron_id,MCS,Features,Monosematic like:,Repeats
0,0.9983,Newlines, periods, unclear,0
1,0.9895,Unclear,0
2,0.9873,www, decreased by http://,1
3,0.9859,period after www,1
4,0.9852,quotation mark after another quotation mark, but maybe a period before?,1
5,0.9802,the word type after a hyp...

#

I think some of these are pretty specific? (edit: e.g. just "x", & I know there are other char-level features like "w", & ", and") I also expect other re-tokenization words (like Harvard) to be here in layer 2 & also layer 1.

keen pivot Jul 4, 2023, 1:56 PM

#

pallid current i'm wondering whether we should use the neuron in a haystack approach to find so...

Though I'm all for this! Wes sent me the link to the datasets used (https://www.dropbox.com/scl/fo/sb5jwfki7t4kvk2rr38t0/h?dl=0&rlkey=gor3lctozovy8417p8k27zbun).

Which should be easy to check once you've trained a k-sparse probe.

bronze wraith Jul 4, 2023, 2:32 PM

#

bitter turtle 5pm is maybe a little early for me unfortunately, how is GMT-17:00?

Just to confirm, we've moved the meeting to this time slot?

bronze wraith Jul 4, 2023, 2:36 PM

#

pallid current from looking at the neuron-in-haystack paper (https://arxiv.org/pdf/2305.01610.p...

reading this paper now, are we doing anything about this?

This example also underscores the dangers of “interpretability illusions” caused by interpreting neurons using just the maximum activating dataset examples

pallid current Jul 4, 2023, 2:54 PM

#

bronze wraith reading this paper now, are we doing anything about this? > This example also un...

not really, the autointerp stuff does top-and-random scoring which tests ability to work on random as well as top

#

but it doesn't do anything to rule out the possiblity of it also responding to other things

bronze wraith Jul 4, 2023, 3:29 PM

#

pallid current not really, the autointerp stuff does top-and-random scoring which tests ability...

if autointerp does both top-and-random then that should give us some insurance against that "interpretability illusion"

bitter turtle Jul 4, 2023, 4:55 PM

#

bronze wraith Just to confirm, we've moved the meeting to this time slot?

I think so yeah. @keen pivot @pallid current ?

bitter turtle Jul 4, 2023, 4:55 PM

#

bronze wraith reading this paper now, are we doing anything about this? > This example also un...

This is my primary concern with current things

keen pivot Jul 4, 2023, 4:56 PM

#

bronze wraith if autointerp does both top-and-random then that should give us some insurance a...

I'd recommend reading my doc. I know the images aren't good & I'm making a better one not optimized for Neel. I do uniform examples

bitter turtle Jul 4, 2023, 4:57 PM

#

Haven't really been able to read through your doc unfortunately logan (just got back from a school meetup thing actually)

keen pivot Jul 4, 2023, 4:58 PM

#

bitter turtle I think so yeah. <@360082080975290369> <@566946805028225034> ?

Hoagy also is going. I chatted w/ him earlier today

bitter turtle Jul 4, 2023, 5:00 PM

#

Now is GMT-17:00 right?

bronze wraith Jul 4, 2023, 5:11 PM

#

An approach I saw for dictionary learning: https://en.wikipedia.org/wiki/K-SVD

K-SVD

In applied mathematics, k-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value decomposition approach. k-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse coding the input data based on the current dictionary, and updating ...

pallid current Jul 4, 2023, 6:00 PM

#

reposting the meeting notes doc here https://docs.google.com/document/d/1J7tGoAhlTqrFfqHgAjehe3HANX2wmb7RlxeVwiEHslM/edit?usp=sharing

Google Docs

Sparse Dictionary Notes: 7/04/2023

Results: Concerns: Dictionary has to reconstruct negative activations & noise? Learned features that are low-MCS Experiments: (logan) - Provide rigorous explanation of low-MCS meaningful features Provide random directions to compare (logan ask Robert) Example of a feature that could be spread ...

#

also i wanted to say before logan left, lee's pushing to get a meeting with olah and anthropic types over the next week or two

#

i'll try to get you guys included which should be possible, if not then i can grab questions from you here

bitter turtle Jul 4, 2023, 8:42 PM

#

For targeting specific features we could literally just do feature learning on e.g. truthful QA activations

keen pivot Jul 4, 2023, 9:55 PM

#

Nora convinced me that one metric to use is model editing metrics. Specifying which model editing metrics would be clarifying on its own, and also help compare different ways of training.

keen pivot Jul 4, 2023, 9:55 PM

#

bitter turtle For targeting specific features we could literally just do feature learning on e...

There are a few different baselines to compare against here which would be nice to contrast!

keen pivot Jul 4, 2023, 9:57 PM

#

bitter turtle For targeting specific features we could literally just do feature learning on e...

Though I will say some TruthfulQA answers aren’t the more truthful, but the more weird answer. I’ll need to give more details later.

bitter turtle Jul 5, 2023, 1:34 AM

#

keen pivot Nora convinced me that one metric to use is model editing metrics. Specifying wh...

Wdym

keen pivot Jul 5, 2023, 1:43 AM

#

bitter turtle Wdym

Any sort of model steering (like make the model more honest or only perform circuit X) can be compared with previous work on model editing.

bitter turtle Jul 5, 2023, 1:44 AM

#

Oh right sure

bitter turtle Jul 5, 2023, 1:46 AM

#

bitter turtle For targeting specific features we could literally just do feature learning on e...

@keen pivot would definitely like to do that with this

#

So like train on activations generated on e.g. truthful QA then select+edit features with the most variance between+consistency within true/false classes (basically VINC but manually on directions restricted to those found by sparse dicts)

bitter turtle Jul 5, 2023, 1:55 AM

#

bronze wraith An approach I saw for dictionary learning: https://en.wikipedia.org/wiki/K-SVD

this seems functionally equivalent to sparse autoencoders

pallid current Jul 5, 2023, 9:46 AM

#

bitter turtle this seems functionally equivalent to sparse autoencoders

agreed in spirit it's similar, though the approach i think is quite different in the fact that it can learn the dictionary activations however it likes, rather than being Relu(Linear(Y)), and the practice of optimizing dict elements 1 by 1 i imagine would lead to very different results

bronze wraith Jul 5, 2023, 2:33 PM

#

bitter turtle this seems functionally equivalent to sparse autoencoders

+1 to hoagy's reply: it has the same inputs and outputs (activations and a dictionary of features, respectively), but the internal process is different. If we're lucky, k-SVD would be some combination of faster/make more accurate reconstructions/find "more meaningful" features compared to the autoencoder.

pallid current Jul 5, 2023, 3:02 PM

#

still not sure why me and logan are getting different interp results but i am finding a different between neuron_basis and sparse coding on pythia layer 1

#

#

dotted lines are 2SD from mean

#

will increase sample size later, its busy trying to perform ICA atm

#

on the other hand, not much evidence of relationship between MCS and autointerp score

#

#

relationship might eventually prove 'significant' but high correlation looks unlikely

keen pivot Jul 5, 2023, 3:04 PM

#

pallid current on the other hand, not much evidence of relationship between MCS and autointerp ...

For this layer 1, ya I think lower MCS was still meaningful to me.

#

I can show that hopefully today or tomorrow!

bitter turtle Jul 5, 2023, 3:12 PM

#

bronze wraith +1 to hoagy's reply: it has the same inputs and outputs (activations and a dicti...

I mean, since it has the same objective, I guess I'd be surprised if it didn't converge on the same dist of dictionaries. The activation thing hoagy pointed out is a very good point tho, I guess maybe we should look at testing more powerful encoders/making the decoder an affine transformation of the sparse dictionary for the big sweep.

pallid current Jul 5, 2023, 3:13 PM

#

making the decoder an affine transformation
what dya mean?

bitter turtle Jul 5, 2023, 3:13 PM

#

pallid current

do you still think this is a bug on your end or a more fundimental thing

bitter turtle Jul 5, 2023, 3:14 PM

#

pallid current > making the decoder an affine transformation what dya mean?

currently, our setup looks like
dict = ReLU(Affine(x))
output = Linear(dict)

maybe we should make the output = Affine(dict)

#

my intuition is that that should 'strengthen' the abilities of the encoder slightly

bitter turtle Jul 5, 2023, 3:16 PM

#

bronze wraith +1 to hoagy's reply: it has the same inputs and outputs (activations and a dicti...

also goo point that it might coverge faster

bronze wraith Jul 5, 2023, 3:16 PM

#

bitter turtle I mean, since it has the same objective, I guess I'd be surprised if it didn't c...

One possible reason they'd come to different results is the L1 penalty, which is in the SA but not in k-SVD. The SA is willing to accept a poor reconstruction (even sending every vector to 0) if it minimizes activations enough. In contrast k-SVD doesn't have an L1 penalty, so its reconstructions dont skew towards 0.

bitter turtle Jul 5, 2023, 3:16 PM

#

oh shit yeah my fault i am blind

#

@bronze wraith where was the implementation you mentioned for K-SVD?

bronze wraith Jul 5, 2023, 3:29 PM

#

bitter turtle <@748975058415910923> where was the implementation you mentioned for K-SVD?

It’s a python library, ksvd! https://github.com/nel215/ksvd

GitHub

GitHub - nel215/ksvd: A ksvd implementation written in python.

A ksvd implementation written in python. Contribute to nel215/ksvd development by creating an account on GitHub.

#

Right now I’m playing with it and I’m finding it a little fiddly (it throws errors if your dimensionality is off)

bitter turtle Jul 5, 2023, 3:32 PM

#

right, we might need to come up with a batched, streaming implementation of that if we want to scale it @bronze wraith

keen pivot Jul 5, 2023, 3:35 PM

#

bitter turtle currently, our setup looks like dict = ReLU(Affine(x)) output = Linear(dict) ma...

It's unclear to me if adding a bias to the decoder would be good or bad (I really don't know!)

bitter turtle Jul 5, 2023, 3:37 PM

#

keen pivot It's unclear to me if adding a bias to the decoder would be good or bad (I reall...

same! one way it might be better is that it might allow the encoder more flexibility in denoising the input signal, but ofc (for MLP activations at least) we already see large biases, and it might just exacerbate those.

bronze wraith Jul 5, 2023, 3:46 PM

#

keen pivot It's unclear to me if adding a bias to the decoder would be good or bad (I reall...

I think adding a decoder bias is approximately equivalent to centering the original dataset at (0,0), so that might help you build intuition.

keen pivot Jul 5, 2023, 3:48 PM

#

For auto-interp, atm it's useful for detecting monosemanticity in the input distribution (ie we assume GPT-4 coming up w/ hypotheses & GPT-3.5 creating accurate predictions in held-out text correlates w/ the underlying feature having a simple description across the entire feature activation range). This can be repeated w/ the features effect on the output & other layer's features.

But, this still leaves open the "interestingness" of features. Two desirable properties are:

The features explain all behavior on the data distribution we care about (ie low reconstruction loss as discussed yesterday)
We can simply express any feature we actually care about (e.g. deception, honesty, australian accent, etc) using these features (w/ "simply" maybe just meaning sparse)

keen pivot Jul 5, 2023, 3:54 PM

#

bronze wraith I think adding a decoder bias is approximately equivalent to centering the origi...

When I look at decoder weights, there's a lot used for reconstructing the negative neuron activations from the GeLU. This seems like it'd be generally true for all features learned, but would cause a problem if multiple features activate at the same time because they'd try to reconstruct the original distribution, but would overlap w/ each other.

The learning process would probably learn correlations for features so each feature only handles 1/N% of the job for reconstructing normal neurons, but there will be noise. I was thinking the bias might help here, but I don't think so.

bitter turtle Jul 5, 2023, 3:57 PM

#

bronze wraith It’s a python library, ksvd! https://github.com/nel215/ksvd

I've found a paper discussing a GPU implementation of matching persuit (~the sparse approximation algo that the original K-SVD algo used (I think it actually used OMP cant remember)): https://arxiv.org/pdf/0809.1833.pdf, and the other part of K-SVD seems to be just SVD so I think we're good on the parallisation front

bitter turtle Jul 5, 2023, 3:59 PM

#

bronze wraith I think adding a decoder bias is approximately equivalent to centering the origi...

for residual stream data yeah, different for MLP thingies maybe?

bitter turtle Jul 5, 2023, 4:00 PM

#

keen pivot When I look at decoder weights, there's a lot used for reconstructing the negati...

what do you mean 'negative neuron activations'?

#

oh, you mean the model uses GELU instead of ReLU and the decoder devotes features to those small negative parts?

bronze wraith Jul 5, 2023, 4:04 PM

#

bitter turtle for residual stream data yeah, different for MLP thingies maybe?

Let me clarify why i said that: fix some bias v in the decoder. Let's say we: 1. shift all inputs back by v (x -> x-v), 2. update the encoder biases the undo this, and 3. remove the bias vector from the decoder. These transformations together should not change the l2 loss term (both your inputs and outputs are shifted back by v, which cancels out), nor the l1 loss term (the encoder activations are exactly the same). So for any decoder-with-a-bias, you can make an exactly equivalent sparse autoencoder without a bias, on a shifted input set, which has the exact same loss. (Epistemic confidence: high)

#

(Epistemic confidence: medium-low): I assumed that centering the dataset at (0,0) minimizes the overall amount of l1 activations needed to make a reconstruction, and therefore this centering would be optimal. But I might be wrong about this

bitter turtle Jul 5, 2023, 4:10 PM

#

bronze wraith (Epistemic confidence: medium-low): I assumed that centering the dataset at (0,0...

I'm not sure this is true for the MLP activation case, where the activation mean is far from the origin. If the toy model asumptions are true, you have a solution with ~zero (ignoring noise) l2 reconstruction loss and 1/k l1 loss for k-sparse activations when the data is not centered.

bitter turtle Jul 5, 2023, 4:11 PM

#

bronze wraith Let me clarify why i said that: fix some bias v in the decoder. Let's say we: 1....

agree with this tho

bitter turtle Jul 5, 2023, 4:11 PM

#

bitter turtle I've found a paper discussing a GPU implementation of matching persuit (~the spa...

also: https://github.com/ariellubonja/omp-parallel-gpu-python

GitHub

GitHub - ariellubonja/omp-parallel-gpu-python: Orthogonal Matching ...

Orthogonal Matching Pursuit, implemented in a parallelized way - GitHub - ariellubonja/omp-parallel-gpu-python: Orthogonal Matching Pursuit, implemented in a parallelized way

keen pivot Jul 5, 2023, 4:15 PM

#

bitter turtle oh, you mean the model uses GELU instead of ReLU and the decoder devotes feature...

Almost! The decoder doesn't devote features, but for a given feature, some of the weights are negative to account for this. Maybe that's ideal. idk

keen pivot Jul 5, 2023, 4:17 PM

#

keen pivot Almost! The decoder doesn't devote features, but for a given feature, some of th...

Here's a hist of weight*max-activation for that feature. The minimal negative value in-distribution is -0.17 for the neuron basis because of the GeLU

bronze wraith Jul 5, 2023, 4:17 PM

#

Oh, sorry one more thing about having a decoder with a bias: you can implement them in autoencoders w/o bias, though there is an L1 penalty. In particular, since the encoder has a bias, you have your encoder learn a feature that always activates exactly 1, and the dictionary element corresponding to that will be your bias. There is an l1 cost since youre always activating that internal feature, but if we specified some features as not having an L1 penalty, this would bypass that (i.e. instead of our L1 penalty being ||y||_1, we make it ||Py||_1, were P is a diagonal matrix of 0s and 1s).

bitter turtle Jul 5, 2023, 4:21 PM

#

bronze wraith One possible reason they'd come to different results is the L1 penalty, which is...

it does have a minimum required sparsity level though afaict, which might turn out to be equivalent? who knows. will be fun to implement.

bronze wraith Jul 5, 2023, 4:22 PM

#

bitter turtle it does have a minimum required sparsity level though afaict, which might turn o...

Yeah, right now I'm running tests to compare speed/reconstruction accuracy/mmcs between SA and kSVD on the toy data. I think I'll have results today!

bitter turtle Jul 5, 2023, 4:22 PM

#

sick

keen pivot Jul 5, 2023, 4:24 PM

#

@bronze wraith , could you give a concrete example of a feature that may be spread across two MLP layers in a Transformer? This is based off the "concatenate multiple layer's activations as input" which both you (& Neel) brought up.

bronze wraith Jul 5, 2023, 4:35 PM

#

keen pivot <@748975058415910923> , could you give a concrete example of a feature that may ...

How about "BERT Rediscovers the Classical NLP Pipeline" (https://arxiv.org/pdf/1905.05950.pdf )? They used linear probes on the internal activations of BERT to try to extract stuff like "part of speech", and find that this info is spread across multiple layers. E.g. this part of section 3.2:

We would like to estimate at which layer in the
encoder a target (s1, s2, label) can be correctly
predicted... A
naive classifier at a single layer cannot either, because information about a particular span may be
spread out across several layers, and as observed
in Peters et al. (2018b) the encoder may choose to
discard information at higher layers.
(emphasis added; link to Peters et al, which I haven't read: https://aclanthology.org/N18-1202.pdf). In the attached image (part of Figure 2), the blue bars are showing which layers are important for correctly probing part of speech, and you can see that info is spread across several layers

Screenshot_2023-07-05_at_12.31.17_PM.png

keen pivot Jul 5, 2023, 4:53 PM

#

Made my much better post! https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tentatively-found-600-monosemantic-features-in-a-small-lm
Update from last: I coded the quantile thing wrong, so now it does look like ~ 60 neurons (still maybe)

Also: Compared w/ the identity dictionary with specific feature seeming to represent only 1 neuron

(tentatively) Found 600+ Monosemantic Features in a Small LM Using ...

Using a sparse autoencoder, I present evidence that the resulting decoder (aka "dictionary") learned 600+ features for Pythia-70M layer_2's MLP (after the GeLU), although I expect around 8k-16k featu…

keen pivot Jul 5, 2023, 5:34 PM

#

bronze wraith How about "BERT Rediscovers the Classical NLP Pipeline" (https://arxiv.org/pdf/1...

Looks like pretty strong evidence there, especially the K_delta, thanks!

I do think then that there will be features more simply represented across multiple layers & doing it only by each layer will have the same representational capacity, but require multiple individual features to do so.

keen pivot Jul 5, 2023, 6:38 PM

#

One problem noticed when training the dictionaries at different layers is that later layers learned way less features & had more dead features (e.g. even 50% of 2k). Data:
Dead neurons (layer 1,3,4,5) (skipped 2 because I already had a dict for it when I trained, but trained on different data, so not including it):
0, 1.7k, 1.3k 1.2k (out of 2k)

High-MCS features:
46%, 4%, 14%, 13% (out of 2k)

#

Possible explanations:

Other dictionaries need a different l1 value (not likely imo. I sweeped l1 values from 1e-5 & 1e-3 & MCS > 0.9 features plummeted for 1e-3)

Note: it is max for 3e-05, but I expect that's because of learning identity because features/token = 700.

#

There are many more features in early layers (focused on grouping tokens & re-tokenizing certain pairs) than middle layers (higher level features?) and later layers (re-tokenizing features). The dead neurons are caused by iterating over the same dataset. Evidence for this here: https://wandb.ai//sparse_coding/sparse coding/reports/Layer-3--Vmlldzo0ODA5ODA1?accessToken=w6c775vurw0wtu80n3617dl41ts9wvqjfnm41uc4xj1pi0yc7vd009lu4ntu8zvr

W&B

Layer 3

This is Pythia layer 3 mlp trained on 60GB repeated 5 times.
Note that the dead neurons are 0 after the first run & the far right l1 value stays ~ the same for MCS features >0.9.

#

If (2), then just running mini-runs w/ fresh data should show increasing number of features in layer 3 at that l1 value. I think I do a fresh-data comparison for layer-1, so I can check that now at least.

Update: can't tell. layer-1 has 0 dead neurons even when you repeat data, so can't tell generalization behavior to layer-3, which is the one I linked. Additionally, this is the layer that when trained on new data goes up to 40% (MCS above 0.9), then down to 3%. This also needs to be explained.

keen pivot Jul 5, 2023, 6:59 PM

#

keen pivot If (2), then just running mini-runs w/ fresh data should show increasing number ...

Possible explanation: The smaller & larger model are simply learning different features because there are so many, especially at such an early layer & w/ so much data.

Additionally may be a thing where dicts of different sizes are biased to learn different features (which was brought up before here), so a previously proposed test was to learn two dicts of same size w/ different initializations.

Additionally, I could look at the features learned by the dictionary. If low-MCS features are meaningful, then seems true.

bronze wraith Jul 5, 2023, 7:41 PM

#

Update on ksvd: when I compared it head-to-head with sparse autoencoders on the toy data, it failed to reconstruct the original features. The MMCS was something like .266 instead of .999 for the sparse autoencoders. Tomorrow I'll try tinkering with it to see if there's a fix!

keen pivot Jul 5, 2023, 7:55 PM

#

@bitter turtle , I'm going to get to it soon (like tomorrow or the next?), but if you'd like to see if the currently learned features across layers have meaningful connections/circuits, it'd be great to have another set of eyes here.

keen pivot Jul 5, 2023, 8:01 PM

#

pallid current still not sure why me and logan are getting different interp results but i am fi...

Glad it's working!:)

I'm curious how well you or I'd do on the task we're giving GPT-4/3.5. Like number normalized activations given a hypothesis (or form the hypothesis). How hard would it be to set that up?

bitter turtle Jul 5, 2023, 8:19 PM

#

keen pivot <@332271551481118732> , I'm going to get to it soon (like tomorrow or the next?)...

Sure can do

#

Just send over the dicts you're looking at ig

bitter turtle Jul 5, 2023, 8:38 PM

#

bronze wraith Update on ksvd: when I compared it head-to-head with sparse autoencoders on the ...

Have you checked the dictionaries K-SVD is learning? OMP learns positive and negative entries for the sparse dictionaries; it might be something to do with this maybe.

bronze wraith Jul 5, 2023, 8:41 PM

#

bitter turtle Have you checked the dictionaries K-SVD is learning? OMP learns positive and neg...

I think you're right its using sparse (positive and negative) combinations, instead of sparse positive combinations, and the issue is probably there. I'll try looking into positive versions of ksvd or see if there's a workaround. Oh, and one thing I should do is look at absolute value of cosine similarity, because it might be learning -1*feature (which would be great but would be ignored by mmcs)

bitter turtle Jul 5, 2023, 8:43 PM

#

bronze wraith I think you're right its using sparse (positive and negative) combinations, inst...

yeah there's a thing called 'nonnegative OMP', i think scikit-learn probbably has an implementation

#

ok @bronze wraith cancel that it doesn't, here's an implementation: https://github.com/davebiagioni/pyomp

GitHub

GitHub - davebiagioni/pyomp: Orthogonal Matching Pursuit (Python)

Orthogonal Matching Pursuit (Python). Contribute to davebiagioni/pyomp development by creating an account on GitHub.

#

just replace the calls to orthogonal_mp_gram with calls to this in the ksvd thingy

pallid current Jul 5, 2023, 10:08 PM

#

more results from autointerp, this time on gpt2 small, still the neuron basis totally failing to beat the random baseline but much higher scores overall. i only recently noticed they use layer 10 of gpt-2 small for their autointerp comparison so i'll run that next

#

it's not clear why they used layer 10 of small when the rest uses XL, might well be cherry picked

bitter turtle Jul 5, 2023, 10:32 PM

#

Gah, I've spent the entire day trying to implement a nonnegative version of K-SVD for the GPU/PyTorch, only to learn that most non-negative least squares algorithms are like deeply not designed for GPUs. Some people have written CUDA kernels for them, and I could do it with GD, but I'm probably going to throw in the towel for the moment and try and do more useful things tomorrow.

#

If @pallid current could set me up with the 8xA40 rig maybe I could get to setting up some generic parallelised code for the big sweep?

keen pivot Jul 5, 2023, 10:40 PM

#

Checking the claim "Low-MCS features are meaningful":
Layer 1: maybe slight correlation w/ MCS & meaningfulness by Logan's standard. Important difference here is layer 1 has ~50% features learned. Also, the low-MCS features felt lamer. https://docs.google.com/spreadsheets/d/1BnFaqn8W9aM1rlosiYFG64VVeeYQNJWFYM-Qt4qj-0o/edit?usp=sharing

Layer 2: (in the LW post, clear correlation, but dominated by dead features)

Layer 4: Also dominated by dead features, but clearer correlation w/ MCS. For the top-MCS features, 75% monosemanticity, whereas low-MCS features are 25%.
https://docs.google.com/spreadsheets/d/1DaPl4sm7KvKr2eVtf2DGSFXLWQyxZaommXm6F1uSnRw/edit?usp=sharing

Google Docs

Layer 1 Features

Sheet1

Link to autoencoder: ? (need to double-check w/ wandb)
dict id,Id,MCS,Feature,Monosemantic? ,autointerp expl,auto interp score,autointerp match,Note
1910,0,1,=" especially after ref-type,1,elements of physical addresses and names.,-0.03,0
1597,1,0.99,big spaces then usepackage,0,the word...

Google Docs

Layer 4 Features

Sheet1

Id,MCS,Feature,Monosemantic? ,Note
0,0.99,Begining & end of first sentence (like a header?),1,0.7619047619
1,0.99,grammar? Uhhh???,0,1/5 tokens activate it, predominately 3 neurons
2,0.99,grammar? Uhhh???,0,1/2 tokens activate it, predominately 2 neurons
3,0.99,Opening ( for english, but ...

#

I also may have noticed a pattern in feature vectors that make sense vs those that don't. Ones that don't have a fairly symmetric weight distribution, whereas "real" features have longer tails.

~~I could show this by plotting MCS by mean & std of the weights.~~, lol nope. No correlation at all.

I believe this will be mostly correct, but misleading because different neurons have different activation distributions, so plotting weight histograms by quantile. I'll also need to multiply the weight by the max-activation of that value I think.

bitter turtle Jul 5, 2023, 10:58 PM

#

bitter turtle Gah, I've spent the entire day trying to implement a nonnegative version of K-SV...

Actually cancel this I can probably just use LASSO and hope for the best. Same number of hyperparams to tune so it's probably* fine*.

#

God I am falling down so many rabbit holes for different sparse factorization mechanisms.

pallid current Jul 6, 2023, 8:24 AM

#

more graph posting, here are the runs from layer 10 gptsmall, this one really should match the results in the direction-finding part of the openai autointerp paper. instead the results are much better, but also dont show the same size of different between neurons and random directions

#

#

@bitter turtle let me know when you're free today and i'll get you set up on the eleuther compute, i'll be in the office in about an hour

bitter turtle Jul 6, 2023, 8:29 AM

#

cool

#

free most times

bitter turtle Jul 6, 2023, 10:08 AM

#

pallid current <@332271551481118732> let me know when you're free today and i'll get you set up...

@pallid current are you available to do this

pallid current Jul 6, 2023, 10:24 AM

#

Hey Aidan, sorry had a meeting earlier so stayed home for a bit, on the train in now, will be there around 12 o'clock

keen pivot Jul 6, 2023, 12:38 PM

#

pallid current more graph posting, here are the runs from layer 10 gptsmall, this one really sh...

Here, the neuron-basis has a median(?) score of like .29 & random is .23, and in the paper it's 0.15 & 0.037?

pallid current Jul 6, 2023, 12:39 PM

#

keen pivot Here, the neuron-basis has a median(?) score of like .29 & random is .23, and in...

mean not median, and 0.06 for random not 0.03 but essentially yes

keen pivot Jul 6, 2023, 12:42 PM

#

Gotcha. Where are you getting 0.06 from?

#

(Like a ctrl-f in the paper)

pallid current Jul 6, 2023, 12:43 PM

#

just before the graph in https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html#sec-direction-finding

#

'We find that the average top-and-random score after 10 iterations is 0.718, substantially higher than the average score for random neurons in this layer (0.147), and higher than the average score for random directions before any optimization (0.061).'

keen pivot Jul 6, 2023, 12:44 PM

#

Isn't that optimizing a direction that's explainable?

pallid current Jul 6, 2023, 12:45 PM

#

0.7 is post optimizing, 0.147 is neuron basis, 0.061 is random direction

keen pivot Jul 6, 2023, 12:45 PM

#

Thanks for explaining:)

#

So it looks like you're finding really awesome random directions then, lol

#

Like this is their random-only scoring.

keen pivot Jul 6, 2023, 12:49 PM

#

pallid current

Is it legal for this to go below 0?

pallid current Jul 6, 2023, 12:50 PM

#

keen pivot Like this is their random-only scoring.

careful, it's confusing, that's random-only scoring which means that they evaluate their ability to predict the activations on random samples

#

not that it's a random direction

#

and also that's layer 10 for gpt2 XL, not small

pallid current Jul 6, 2023, 12:51 PM

#

keen pivot Is it legal for this to go below 0?

but yeah it is possible to go <0, im not sure why they cut it off

keen pivot Jul 6, 2023, 12:54 PM

#

pallid current and also that's layer 10 for gpt2 XL, not small

Lol. Okay, read the paper some more and got it!

#

Have you tried looking at the random directions that scored highly yourself?

pallid current Jul 6, 2023, 1:05 PM

#

well i've looked at the activations that i feed into gpt-4 and those make sense, and also match up to me to the feature activations i got out when got the activations on their own

#

my guess for why i'm seeing surprisingly high scores is that i'm working with 50k fragments from openwebtext, but those fragments are from the very beginning of the corpus, which means there are multiple fragments from the same bit of text

#

and so the top-scoring fragments in the validation set are quite likely to come from the same paragraph as the top-scoring fragments in the train set, which makes the task quite a lot easier

#

changing it now to only take max 1 fragment from each sentence. i think sentences are uncorrelated but will check

keen pivot Jul 6, 2023, 1:47 PM

#

pallid current and so the top-scoring fragments in the validation set are quite likely to come ...

But would that help for random directions? I would expect a random direction to represent many features as you move along it, so would only do good when predicting other high-activating examples, and fail terribly for random-activating ones.

pallid current Jul 6, 2023, 1:49 PM

#

keen pivot But would that help for random directions? I would expect a random direction to ...

yeah i think something like this is going on, but we're doing random-and-top scoring and i guess most of the score is coming from top

#

(i think neuron basis is also pretty bad for random samples)

bitter turtle Jul 6, 2023, 1:58 PM

#

Just a heads-up: pbbly gonna switch to safetensors for the big sweep, it allows memory-mapped/'lazy' tensors which is pretty important for not eating memory when we're doing parallel runs

pallid current Jul 6, 2023, 1:58 PM

#

ok sounds good

bitter turtle Jul 6, 2023, 2:14 PM

#

hmm i guess you might not need to idk yet

pallid current Jul 6, 2023, 2:22 PM

#

i dont know anything about safetensors but seems like it's on the up so interested to play around with it

pallid current Jul 6, 2023, 2:58 PM

#

pallid current changing it now to only take max 1 fragment from each sentence. i think sentence...

this solves the problem of surprisingly high scores. unfortunately, it also makes the neuron basis completely useless, so i guess all previous results are spurious and still work to do. ah well.

keen pivot Jul 6, 2023, 5:00 PM

#

https://docs.google.com/spreadsheets/d/1BnFaqn8W9aM1rlosiYFG64VVeeYQNJWFYM-Qt4qj-0o/edit?usp=sharing

Looking at the features of layer 1 across training (for 2k dictionary). My understanding of feature tracks w/ self-similarity.

Google Docs

Layer 1 Features

Layer1 5 repeats

Link to autoencoder: ? (need to double-check w/ wandb)
dict id,Id,MCS,Feature,Monosemantic? ,autointerp expl,auto interp score,autointerp match,Note
1910,0,1,=" especially after ref-type,1,elements of physical addresses and names.,-0.03,0
1597,1,0.99,big spaces then usepackage,...

#

Green means monosemantic, yellow means idk, red means polysemantic
There are 3/6 examples that mostly stay the same feature.
2/6 appear to represent meaningful, monosemantic features, but change over time.

1/6 might represent some meaningful features, but it's unclear but does clearly change meaning across time

keen pivot Jul 6, 2023, 6:42 PM

#

For the 4k dict, I checked how many features have a self-similarity of 0.9 (after 60GB of data), which then dropped below 0.6:
"The number of vectors that reach at least 0.9 and then drop below 0.6 is 1707."

So plotting all of them, there are 20-40% of features that "changed" (though I haven't proved that high self-similarity implies a meaningful, monosemantic feature)

#

For the 2k dict, it's like 600/2k, so 25% of features that change drastically.

#

One solution may be to train the dict again, but if a feature's self-sim> 0.9 after each mini-run, we learn a gradient mask (Hoagy's suggestion) to keep those features.

This also lends itself to a stopping point: if all features are frozen (because they were >-0.9 self sim after a minirun), then we're done.

keen pivot Jul 6, 2023, 7:46 PM

#

On the topic of "Other ways to do dictionary learning", this suggests an adaptive l1 parameter. Additionally, the previous section discusses using fast ISTA (FISTA) for the alternating updates (like K-SVD does), which lecun papers also use.

#

paper link: https://arxiv.org/pdf/2108.11730.pdf

keen pivot Jul 6, 2023, 9:32 PM

#

I was trying to figure out if this sentence related to the "an" feature, and the translation was just a bit surprising.

Turns out though that "ad" is italian for "to", but only when the following word starts w/ a vowel (else it's "a"). So very much like "an"!

#

One big confusion I have is that the max-activating logit diff doesn't every make sense, even in the last layer, where I'm just directly unembedding.

For example, in distribution, the "an" feature affects the log-prob prediction of vowel-starting-words, but only 1/30 max-diff tokens start w/ vowels, and the one that is is "ural", with no beginning space, so doesn't really count (I think, though I'd count "est" in "c'est")

keen pivot Jul 6, 2023, 10:07 PM

#

Note: Transformer Lens uses twice the GPU memory as just loading in the model normally. I think baukit is the way to go for larger models & we'll just need to multiply by GeLU for the activations.

keen pivot Jul 7, 2023, 3:59 AM

#

Oh yes! Oh oh! ahaha Oh...yes

bitter turtle Jul 7, 2023, 10:43 AM

#

right got some good parallelized training code setup, it's a bit weird and finiky to use since I wanted to do vectorized model ensembling and that doesn't vibe well with multiprocessing/shared memory sometimes so I came up with a bunch of hacks to get around it

#

brief walkthrough of current system if you want to use it (code on my gh)
current setup is you write some functions like this

class FunctionalSAE:
    @staticmethod
    def init(activation_size, n_dict_components, l1_alpha, bias_decay=0.0, device=None):
        params = {}
        buffers = {}

        params["encoder"] = torch.empty((n_dict_components, activation_size), device=device)
        nn.init.orthogonal_(params["encoder"])

        params["encoder_bias"] = torch.empty((n_dict_components,), device=device)

        params["decoder"] = torch.empty((n_dict_components, activation_size), device=device)
        nn.init.orthogonal_(params["decoder"])

        buffers["l1_alpha"] = torch.tensor(l1_alpha, device=device)
        buffers["bias_decay"] = torch.tensor(bias_decay, device=device)

        return params, buffers
    
    @staticmethod
    def loss(params, buffers, batch):
        c = torch.einsum("nd,bd->bn", params["encoder"], batch)
        c = c + params["encoder_bias"]
        c = F.relu(c)

        normed_weights = nn.functional.normalize(params["decoder"], dim=0)

        x_hat = torch.einsum("nd,bn->bd", normed_weights, c)

        l_reconstruction = F.mse_loss(x_hat, batch)
        l_l1 = buffers["l1_alpha"] * torch.norm(c, 1, dim=1).mean()
        l_bias_decay = buffers["bias_decay"] * torch.norm(params["decoder"], 2)
        
        return l_reconstruction + l_l1 + l_bias_decay, (c, l_reconstruction, l_l1)

and make a bunch of instances of it like:

all_models = []
for i, dict_size in enumerate(dict_sizes):
    models = [FunctionalSAE.init(activation_size, dict_size, l1_alpha) for l1_alpha in l1_alphas]
    all_models.append(models)

then, vectorize all the models with the same internal dimensions with FunctionalEnsemble (each ensemble on a different GPU) and send it to dispatch_on_chunk to do all the multiprocessing

#

it's weird but it works and ensembling is good

#

especially for our small dict sizes

keen pivot Jul 7, 2023, 1:50 PM

#

@manic wind Did you ever work on your idea of contrastive learning of features and their effect on the output?

bitter turtle Jul 7, 2023, 2:08 PM

#

Ooh, what was that

keen pivot Jul 7, 2023, 2:21 PM

#

If you really want to do good model editing based off internals, it makes sense to just directly learn which neurons have which effect on the output.

So in this case, optimize for encodings of neuron activations and their respective logits to be similar.

manic wind Jul 7, 2023, 2:35 PM

#

keen pivot <@377586136845123605> Did you ever work on your idea of contrastive learning of ...

Ah no, ended up working on casual models and relating them to transformers, but would be interested if there's any ideas with the contrastive learning. It was just an idea, didn't do anything towards implementing

keen pivot Jul 7, 2023, 2:38 PM

#

manic wind Ah no, ended up working on casual models and relating them to transformers, but ...

I think this’d be a great project to at least try.

I could get some initial results and then try to find contrastive learning people to work on it.

bitter turtle Jul 7, 2023, 2:40 PM

#

keen pivot If you really want to do good model editing based off internals, it makes sense ...

what, so, find directions that correspond to max logit change or something?

manic wind Jul 7, 2023, 2:41 PM

#

keen pivot I think this’d be a great project to at least try. I could get some initial res...

Go for it.i unfortunately don't have time to help out. Most likely

#

But would be excited to know if it has any hope

bitter turtle Jul 7, 2023, 2:43 PM

#

bitter turtle what, so, find directions that correspond to max logit change or something?

isn't that like, one cross-covariance matrix calculation (assuming ~linearity)

keen pivot Jul 7, 2023, 2:43 PM

#

bitter turtle what, so, find directions that correspond to max logit change or something?

I don’t think so. For images and captions, there’s structure to what’s an image of a cat that will link all cat things in the latent direction.

Here, we’d be learning which neuron directions correspond to which changes in logits.

bitter turtle Jul 7, 2023, 2:44 PM

#

I guess that's kind of what I meant maybe slightly

keen pivot Jul 7, 2023, 2:44 PM

#

bitter turtle isn't that like, one cross-covariance matrix calculation (assuming ~linearity)

Ya, I’m unsure on the current literature what people tend to do. Linearity of input might not work here.

bitter turtle Jul 7, 2023, 2:45 PM

#

anyway on a different tack what hyperparameter ranges should we look at; I am pretty close to being able to start off a run on the pod @pallid current @bronze wraith

keen pivot Jul 7, 2023, 2:45 PM

#

For which setting?

bitter turtle Jul 7, 2023, 2:45 PM

#

keen pivot Ya, I’m unsure on the current literature what people tend to do. Linearity of in...

what literature?

keen pivot Jul 7, 2023, 2:46 PM

#

bitter turtle what literature?

Contrastive learning

bitter turtle Jul 7, 2023, 2:46 PM

#

keen pivot For which setting?

all of them, predominantly dict ratio, l1 coef, l2 bias weight decay coef

bitter turtle Jul 7, 2023, 2:47 PM

#

keen pivot Contrastive learning

that seems unspecific

#

like sparse feature extraction is a type of contrastive learning

keen pivot Jul 7, 2023, 2:49 PM

#

bitter turtle all of them, predominantly dict ratio, l1 coef, l2 bias weight decay coef

Typically 3e-4 works for the Pythia models so around there? (Though I think a little higher may be better)

Not more than 0.1, not less than 1e-6

bitter turtle Jul 7, 2023, 2:50 PM

#

cool

keen pivot Jul 7, 2023, 2:50 PM

#

The thing to look at here is the features/datapoint. Shouldn’t be <1 or >100

manic wind Jul 7, 2023, 2:50 PM

#

I was considering using a package for a particular contrastive learning method called CEBRA that was actually designed for neuroscience experiments. That may or may not be the right move though

keen pivot Jul 7, 2023, 2:50 PM

#

Totally don’t know about weight decay

bitter turtle Jul 7, 2023, 2:51 PM

#

manic wind I was considering using a package for a particular contrastive learning method c...

what is the time-series you're considering here?

bitter turtle Jul 7, 2023, 2:51 PM

#

keen pivot Totally don’t know about weight decay

yeah im probably only going to look at 2-3 decay settings (including 0). If it is good, we can look more into it later

keen pivot Jul 7, 2023, 2:52 PM

#

Weight decay is kind of weird because of the distribution of neuron activations, but I agree that something like this should work

bitter turtle Jul 7, 2023, 2:54 PM

#

Oh, also, does anyone know if the encoder turns out to be some scaling of the transpose of the decoder? it seems reasonable that it should, and i'm wondering if we can get away with doing (x * D.T) @ batch instead of E @ batch

manic wind Jul 7, 2023, 2:55 PM

#

bitter turtle what is the time-series you're considering here?

The thought was to replace time with logit values or something like that. It really depends on the details of what you want

bitter turtle Jul 7, 2023, 2:55 PM

#

could you explain more?

#

maybe in a different channel

manic wind Jul 7, 2023, 2:56 PM

#

Sure! I actually don't have time now but can explain later if you start a channel (or DM)

keen pivot Jul 7, 2023, 3:00 PM

#

The median activation of neurons in layer 2 (pythia post GeLU) has a few outliers, w/ the majority being negative because of the GeLU.

Weight decay would interfere w/ this reconstruction, but not if we had the data normalized (I think)

keen pivot Jul 7, 2023, 3:02 PM

#

bitter turtle Oh, also, does anyone know if the encoder turns out to be some scaling of the tr...

Oh ya, a tied encoder/decoder for the linear layer would be interesting. This would get around the problem I have of there not being a specific direction to ablate in the encoder.

keen pivot Jul 7, 2023, 3:07 PM

#

bitter turtle Oh, also, does anyone know if the encoder turns out to be some scaling of the tr...

I don't think so. Just divided one by the other, so the resulting ratio should be a near constant if so. Could've coded it wrong, but both are size 4k,2k & div was same shape

bitter turtle Jul 7, 2023, 3:07 PM

#

keen pivot The median activation of neurons in layer 2 (pythia post GeLU) has a few outlier...

Normalise how here?

bitter turtle Jul 7, 2023, 3:08 PM

#

keen pivot I don't think so. Just divided one by the other, so the resulting ratio should b...

Oh, I meant to include scaling by a vector

#

Like if you normalise the columns/rows or whatever and then compare are they similar

keen pivot Jul 7, 2023, 3:09 PM

#

bitter turtle Normalise how here?

probably pre-GeLU

bitter turtle Jul 7, 2023, 3:09 PM

#

So like take pre-GELU data? Hmm idk might as well residual stream it

bitter turtle Jul 7, 2023, 3:11 PM

#

keen pivot The median activation of neurons in layer 2 (pythia post GeLU) has a few outlier...

Maybe we should just take ReLU of the activations and blindly guess that no useful information is stored in the negatives

keen pivot Jul 7, 2023, 3:11 PM

#

bitter turtle So like take pre-GELU data? Hmm idk might as well residual stream it

Oh I meant normalize pre-GeLU, then apply it. Though maybe normalizing post-GeLU is equivalent

bitter turtle Jul 7, 2023, 3:11 PM

#

Normalise how here?

#

Like, centering, what?

keen pivot Jul 7, 2023, 3:12 PM

#

bitter turtle Maybe we should just take ReLU of the activations and blindly guess that no usef...

Could be tested by doing that & checking the CE loss. I believe Noa Nabeshima did that & did see an effect on like early layers or something

keen pivot Jul 7, 2023, 3:12 PM

#

bitter turtle Like, centering, what?

centered mean & std of 1

#

Though maybe a weight decay & bias in decoder as you mentioned earlier?

bitter turtle Jul 7, 2023, 3:18 PM

#

keen pivot centered mean & std of 1

that seems like not a very good idea for MLP activations, especially since we don't have a bias on the decoder, and the directions probably don't start at the mean of the data. Also not sure what you gain from whitening the data here. Centering makes sense for the residual stream for sure. Also, not sure it makes much sense for pre-GeLU, since it probably severely impacts model performance and we wouldn't be looking at model activations anymore

bitter turtle Jul 7, 2023, 3:18 PM

#

bitter turtle Oh, I meant to include scaling by a vector

Do you have any results on this front

keen pivot Jul 7, 2023, 3:22 PM

#

bitter turtle Do you have any results on this front

Ah, thanks for reminding me!

#

#

And in case I swapped the row & columns

keen pivot Jul 7, 2023, 3:23 PM

#

bitter turtle that seems like not a very good idea for MLP activations, especially since we do...

If the concern is model performance, then you can save the mean/std you use to normalize & then undo it when you do model editing

#

But if the bias on decoder helps, then that may just be easier to do.

bitter turtle Jul 7, 2023, 4:01 PM

#

keen pivot If the concern is model performance, then you can save the mean/std you use to n...

I mean like I highly doubt we'll learn useful sparse dictionaries if we center the MLP activations

#

and train on that

#

since we won't be searching for a dictionary the model 'uses'

#

like, the data is intrinsically not centered

bitter turtle Jul 7, 2023, 4:59 PM

#

keen pivot

hmm

#

maybe look at mean cosine similarity? Like, check if E @ D where they're both normalised has ~ones on the diagonal?

#

I just don't have any dict/encoder pairs to hand unfortunately:(

bronze wraith Jul 7, 2023, 8:22 PM

#

I'm throwing in the towel with KSVD. I found it worked very well if you know ahead of time how many features there are, but it struggled if you increase that number by just a bit. In the attached diagram, the 10 standard basis vectors were taken as features in 10D space, and we randomly chose 3 to be active at the same time. The horizontal axis is the number of features you told it to find in each OMP search, ranging from 3 (the correct value) to 6. At x=3, it converges in 23 epochs to the correct features, but for x>3 it takes >100 epochs to converge and at 100 epochs the MMACS (mean max absolute cosine sim) is bad. (I ran these til convergence and the MMACS never became great.)

keen pivot Jul 7, 2023, 10:08 PM

#

bronze wraith I'm throwing in the towel with KSVD. I found it worked very well if you know ahe...

Thanks for trying! Do you think this is irrecoverable? Generally methods like this would have the same problem?

keen pivot Jul 7, 2023, 10:38 PM

#

#

One reason we may want a changing l1 value is because larger dictionaries have high feature-activations/datapoint.

We could instead vary the l1-set point: start it low & increase it until the feature-activations/datapoint are at least X (and maybe increase it if goes below but I don't expect that). Then we can just vary this set point when doing parameter sweeping

keen pivot Jul 7, 2023, 11:17 PM

#

@pallid current @bitter turtle @bronze wraith Getting really good results looking at residual stream layer 2 (thanks for pushing residual stream Aiden!)

Like types of features (and ablation effect):

Single token detectors (strong effect on bigrams)
German detectors (strong effect on specific German words)
Words after places (other statistical bi/tri-gram stuff? Unsure)

I'm also getting like 1k 2.5k features.

#

For the image, it's the ablation of text effect. So ablating the location-token makes the next token's feature activation go to 0.

#

Importantly, this has a much stronger & intuitive effect when ablating the direction.

#

Notably I made two changes at once (which I need to untangle)

switched to residual
Added a bias to decoder (which may affect the logit-diff ablation effect)

keen pivot Jul 8, 2023, 3:33 AM

#

Like look at these kids

#

German one

bronze wraith Jul 8, 2023, 11:53 AM

#

keen pivot Thanks for trying! Do you think this is irrecoverable? Generally methods like th...

I'm somewhat confused by these results, so because this is a literally textbook algorithm, so its weird to me that it wouldn't perform particularly well. So either I'm doing something wrong (and aidan and the package implementing this also did that stuff wrong), or theres some magic sauce in sparse autoencoders that I don't understand

keen pivot Jul 8, 2023, 11:57 AM

#

bronze wraith I'm somewhat confused by these results, so because this is a literally textbook ...

Gotcha. Is this expected if you don't get the underlying number of features right? I would expect it should work to produce at least a linear decomposition.

#

Just not the underlying linear decomposition in your results

bronze wraith Jul 8, 2023, 12:03 PM

#

keen pivot Gotcha. Is this expected if you don't get the underlying number of features righ...

yeah, it isn't converging to a highly accurate reconstruction (it gets within about 1e-4, but stops improving, whereas with the exact right number of features it gets within 1e-6 quickly and keeps improving). and I could believe that this would recover some features, just not the canonical ones, but if thats the case, why do sparse autoencoders succeed at that task?

keen pivot Jul 8, 2023, 12:12 PM

#

bronze wraith yeah, it isn't converging to a highly accurate reconstruction (it gets within ab...

This is the only mention of K-SVD in that dictionary paper I linked, but only said it's not useful for large datasets. I would've expected it to work here though.

#

It'd be a weird coding error for it to work at the right number of features and not the others

bronze wraith Jul 8, 2023, 12:13 PM

#

yep!

#

it works shockingly well with the right number of features too!

keen pivot Jul 8, 2023, 2:47 PM

#

Brass tacks though: I think we’re done. Like these results are huge. There’s a few due diligence things to do, better codebase, and applications, but everyday research can handle that.

bitter turtle Jul 8, 2023, 5:22 PM

#

we might want to try a kind of hybrid approach where we use OMP/FISTA for the encoder and just regular optimisation/least-squares regression for the dictionary

#

hoagy was saying Anthropic was looking into that for some reason

#

on a similar vein we could tie the weights of the encoder to the decoder

bitter turtle Jul 8, 2023, 6:46 PM

#

keen pivot Brass tacks though: I think we’re done. Like these results are huge. There’s a f...

This seems an overly strong claim. I'd still like to see

more thorough analysis of good hyperparameters (inc. bias on decoder, bias norm etc)
large-scale (auto)interp on (all/sufficiently-representitive sample of high MCS features)
causal scrubbing with learnt features on specific algorithmic tasks
an analysis of how complete the feature set we learn is (for example, performance when using idk only features > 0.9; might be a bit pointless because it's heavily overcomplete, but could equivalently do 'performance loss when replacing layer X with reconstruction from sparse dictionary'
also, if features are truly correctly represented by the TMOS model, we should probably have a sparseish covariance matrix when rotated into the top-k features for k<dimensionality of residual stream. not sure we have this

#

like, ideally, we should be able to identify circuits using this, and we haven't explored this yet

#

I'm not even convinced that residual stream representations are 'mostly' linear

#

Something something ROME edit failures or whatever

keen pivot Jul 8, 2023, 7:09 PM

#

bitter turtle This seems an overly strong claim. I'd still like to see - more thorough analysi...

I also agree on all of these (though unclear on the meaning of the last one). Like work does still need to done, but maybe I'm more confident that this will just end up working.

keen pivot Jul 8, 2023, 7:10 PM

#

bitter turtle like, ideally, we should be able to identify circuits using this, and we haven't...

This is one of my questions: Residual stream seems really good, but I'm expecting a large amount of redundancy if you learn dictionaries across multiple layers, which makes circuit like stuff hard.

bitter turtle Jul 8, 2023, 7:21 PM

#

wdym

bitter turtle Jul 8, 2023, 7:22 PM

#

keen pivot I also agree on all of these (though unclear on the meaning of the last one). Li...

so like if features are sparse and mostly represent independent things then we should see that the covariance of activated features reflects this

#

I guess I'm using the dictionary as a proxy for feature activation here which might be problematic idk feel like that should cancel out

keen pivot Jul 8, 2023, 7:29 PM

#

bitter turtle wdym

The residual stream carries information from layer 1 to 2 so I expect dictionaries learned for both will have a large amount of overlap

#

Lecun's paper actually shows differences across early, mid, & late layers: https://arxiv.org/pdf/2103.15949.pdf

bitter turtle Jul 8, 2023, 7:30 PM

#

keen pivot The residual stream carries information from layer 1 to 2 so I expect dictionari...

surely that makes circuit stuff easier? like, if we can treat some directions in both dictionaries as equivalent, we have some of the work done for us

keen pivot Jul 8, 2023, 7:30 PM

#

bitter turtle so like if features are sparse and mostly represent independent things then we s...

covariance of activated features in one layer?

bitter turtle Jul 8, 2023, 7:30 PM

#

keen pivot Lecun's paper actually shows differences across early, mid, & late layers: https...

I'm on data, don't want to load pdf; what differences?

keen pivot Jul 8, 2023, 7:31 PM

#

bitter turtle surely that makes circuit stuff _easier_? like, if we can treat some directions ...

Ah, I agree. It just seemed wasteful, but we can totally be wasteful & inefficient if it works.

keen pivot Jul 8, 2023, 7:31 PM

#

bitter turtle I'm on data, don't want to load pdf; what differences?

Early layers were more like single meaning words & later layers were higher level features (I'd assume like the german one or repeated tokens one?)

#

Though I saw several different types in layer 2 of Pythia

bitter turtle Jul 8, 2023, 7:32 PM

#

oh, right, that, yeah there's going to be some difference but it's a continuum and for neighbouring layers I'd expect high similarity

bitter turtle Jul 8, 2023, 7:33 PM

#

bitter turtle I guess I'm using the dictionary as a proxy for feature activation here which mi...

I guess I want better denoising encoders for this

keen pivot Jul 8, 2023, 7:40 PM

#

Posted recent results on the residual stream:
https://www.lesswrong.com/posts/Q76CpqHeEMykKpFdB/really-strong-features-found-in-residual-stream

Really Strong Features Found in Residual Stream — LessWrong

[I'm writing this quickly because the results are really strong. I still need to do due diligence & compare to baselines, but it's really exciting!] …

errant nova Jul 9, 2023, 4:20 AM

#

keen pivot Made my much better post! https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tent...

I’m enjoying reading this! If you want us to promote this or any other materials y’all put out on the EleutherAI blog or social media just ask.

I noticed a couple weirdnesses in this paragraph:

To be clear, I am running datapoints through Pythia-70M, grabbing the activations mid-way through at layer 2's MLP after the GeLU, & running that through the autoencoder, grabbing the feature magnitudes ie latent activations

”Pythia 70M” is actually named Pythia 160M. I know having the model names change is annoying, but it’s much less annoying than having different models in the suite follow different naming conventions!
The MLP and Attention layers in Pythia are computed in parallel. Does “after Layer 2’s MLP” mean “before Layer 2’s attention layer writes to the residual stream”?

Finally I was wondering if you had tried applying the Tuned Lens and if the Logit Lens was giving you better results, or if you hadn’t tried the Tuned Lens yet. IIRC the Logit Lens does work reasonably well for Pythia, but you should expect it’s behavior to fall apart when examining other models like BLOOM and GPT-Neo

#

Also, @keen pivot you now have the Research Lead role. The primary change this brings is the ability to pin and un-pin posts, edit channel descriptions, delete posts, and assign low-level roles.

Please use this power primarily to manage this channel, but if you feel like spending some time doing miscellaneous moderation tasks and cleaning up spam we’ll hardly complain. You don’t have the power to ban people; if someone needs banning that will need to be referred to a Staff member.

bitter turtle Jul 9, 2023, 7:47 AM

#

bitter turtle This seems an overly strong claim. I'd still like to see - more thorough analysi...

Add to this list:

more toy model stuff trying to fit the behaviour of the actual data
testing hypotheses about why the residual stream outperforms MLP stuff using toy models (maybe look at sae effectiveness with noisy data)
compare perf over entire dataset Vs like QA or math

#

(keep aiming to look into this and never doing it)

#

(maybe I will eventually, but not particularly confident about actually getting any useful results)

keen pivot Jul 9, 2023, 11:57 AM

#

errant nova I’m enjoying reading this! If you want us to promote this or any other materials...

Thanks!:) I'll probably take you up on getting it promoted, probably w/ a better post this week.

Updated the Pythia 70M to 160M on both posts; thanks! [Edit: Looking at the table of the bottom of https://huggingface.co/EleutherAI/pythia-160m, pythia-70M is 6 layers & 160M is 12. I'm currently using the 6 layer one, so unable to square this]
Correct. Last post was "mid-MLP" as in after the first linear layer & activation function. The latest post is the residual stream w/ much better results (link: https://www.lesswrong.com/posts/Q76CpqHeEMykKpFdB/really-strong-features-found-in-residual-stream)
Regarding Tuned/Logit Lens: the latest post does get much, much better results here, this is for both the logit lens & ablating the feature direction. I would like to integrate Tuned Lens here though, especially for larger models & if applying to BLOOM/OPT/GPT-Neo.

Really Strong Features Found in Residual Stream — LessWrong

[I'm writing this quickly because the results are really strong. I still need to do due diligence & compare to baselines, but it's really exciting!] …

pallid current Jul 9, 2023, 12:12 PM

#

bitter turtle Oh, also, does anyone know if the encoder turns out to be some scaling of the tr...

late to this sorry, but i asked Lee and he said that he tried it, but that he couldn't get it to learn well (low MMCS), but that he thinks it should work, and is interested to see it tried again

keen pivot Jul 9, 2023, 12:17 PM

#

bitter turtle Add to this list: - more toy model stuff trying to fit the behaviour of the actu...

The strongest effect of residual stream is a different set of features (maybe a lot of embedding ones?, but also some overlap) & much stronger direction ablation effect/logit lens.

I still haven't trained a non-affine decoder on residual stream yet to disentangle that part.

errant nova Jul 9, 2023, 12:26 PM

#

keen pivot Thanks!:) I'll probably take you up on getting it promoted, probably w/ a better...

Ugh I should never say things after midnight. Apparently I just had a brain fart and you were right about model sizes

bitter turtle Jul 9, 2023, 12:31 PM

#

keen pivot The strongest effect of residual stream is a different set of features (maybe a ...

Not sure what you mean by 'ablation effect' here; could you elaborate?

keen pivot Jul 9, 2023, 12:31 PM

#

errant nova Ugh I should never say things after midnight. Apparently I just had a brain fart...

Happens. Thanks for the update!

keen pivot Jul 9, 2023, 12:32 PM

#

bitter turtle Not sure what you mean by 'ablation effect' here; could you elaborate?

Both logit lens & ablating the feature direction rely on a good feature direction.

If we have a feature direction, we can project onto it's orthogonal direction everytime it has non-zero activation. What I actually do is subtract by that direction*magnitude.

pallid current Jul 9, 2023, 1:10 PM

#

something i'd be interested to test in the upcoming sweep is what would happen if for the MLP activations we constrained the features to be positive only.

#

you'd need to allow a bias to help it account for negative activations

#

but i wonder if it would help it towards correct solutions given that we expect most features to be pretty much entirely positive - it's kinda strange to think about what a negative valued feature would look like in the MLP, like working only in the negative range of the GELU would make it extremely sensitive to interference. i suppose you could have a 'feature' which cancels out the activation of other features at certain times, though maybe that would better be understood as just being a part of how the activation conditions for the positive features are defined

#

will check later today whether we're seeing significant positive activations in our current dicts

#

sent an email to wes gurnee about getting the distributed features that they found in early pythia layers which respond to particular n-grams

#

am now on PST time btw! and will be properly back in the swing of things on monday

bitter turtle Jul 9, 2023, 2:59 PM

#

pallid current something i'd be interested to test in the upcoming sweep is what would happen i...

yea I was thinking about literally just slapping ReLU on the activations when we're preprocessing

#

should be equivalent mod dead neurons ig

pallid current Jul 9, 2023, 3:01 PM

#

why would preprocessing with relu have the same effect as constraining features to be positive?

bitter turtle Jul 9, 2023, 3:02 PM

#

no incentive to learn negative features

pallid current Jul 9, 2023, 3:02 PM

#

the kind of negative interference i'm imagining could still happen with positive-only activations i think

bitter turtle Jul 9, 2023, 3:02 PM

#

might be misunderstanding what you mean by positive here

pallid current Jul 9, 2023, 3:02 PM

#

just that each entry in the decoder matrix would be positive only

bitter turtle Jul 9, 2023, 3:03 PM

#

if you mean 'only directions with +ve coefs' should be the same

#

well, no incentive to learn negative features

#

or, if negative features were learnt, they would be equivalent to 0

#

like, I don't see why the encoder should need to use the negatives in the activation at all

#

oh I am totally misunderstanding yeah constrain it to be positive ignore me 👍

#

like, I would see passing the activations through ReLU as "cleaning the GELU noise so the dictionary doesn't pay attention to it" and constraining the dict to have +ve coefs as "limiting the computation the dictionary can do", is that what you're getting at @pallid current?

pallid current Jul 9, 2023, 3:14 PM

#

bitter turtle like, I would see passing the activations through ReLU as "cleaning the GELU noi...

yeah i think this is about right. i don't think it's right to call GELU negatives noise (i actually wonder if it helps reduce interference) because i highly doubt that the model would work well if you added the extra relu, but yeah 'limiting the computation' is about right

#

i'm wondering if there are activation vectors that the model learns to explain as (a * feature x - b * feature y) which doesn't fit my model of how features should work

#

but that also might be a problem for my understanding of features so i would only be doing it as a tentative test

bitter turtle Jul 9, 2023, 3:16 PM

#

pallid current yeah i think this is about right. i don't think it's right to call GELU negative...

Can probably test this

pallid current Jul 9, 2023, 3:27 PM

#

which bit? i'm really interested in what would make a good test for whether gelu helps with interference, probably along the lines of the test that you wrote up a couple of weeks agoi

#

somewhat related but there's something i really want to test with that setup that i thought of last night. setup is you have multiple MLP layers in sequence which are trying to calculate n distinct features, where n is more than total number of neurons. question is whether, for each feature that is calculated, are they calculated in one layer only, or does the neuron e.g. do some preprocessing in layer 1 to calculate in layer 2, or perhaps use layer 2 to clean up interference from layer 1?

bitter turtle Jul 9, 2023, 4:05 PM

#

total meaning 'more than both MLP layers combined' here?

pallid current Jul 9, 2023, 4:05 PM

#

bitter turtle total meaning 'more than both MLP layers combined' here?

yup

keen pivot Jul 9, 2023, 7:31 PM

#

1. learnable l1 (based off features/activation)
1. Re-init features when dead
1. Perplexity difference
- a. When replacing whole layer w/ reconstruction
- b. Just high-MCS features (and potentially high-MCS datapoints only)
1. Keep features if self-sim=0.9 after N_GB & not dead
1. Decoder L2/simplicity term
1. Affine decoder vs linear
1. Changing toy model to better match current performance
- a. write down current differences
- b. Residual stream outperforms (maybe look at sae effectiveness with noisy data)
1. compare perf over entire dataset Vs like QA or math @bitter turtle , what metric for performance did you mean
1. Circuit finding & causal interventions (on algorithmic tasks?) @bitter turtle
- a. How to find circuits if doing residual stream
1. Tuned Lens
1. Better wandb/aws setup
- a. Easily get graphs on same page (Do we just manually do it every time? Groups mess it up)
- a. naming scheme when uploading to aws isn't useful. Just timestamp, when model_name & layer would help
1. Large model features
- a. switch to bau-kit for >6B models
- b. 1B-param features
1. Auto-interp - good for hypothesis refining of input, but what about:
- a. hypothesis testing on effect on output (ablating direction/logit lens)
- b. marking "interesting" features (or categorizing features in general?)
1. Aiden's TMOS k-covariance thing(?) @bitter turtle
1. Talk to expert in dictionary learning
- a. Anthropic
- b. maybe MIT person Logan knows?
1. Compare w/ Baselines: PCA & Reconstruction ICA

keen pivot Jul 9, 2023, 7:31 PM

#

keen pivot - 1. learnable l1 (based off features/activation) - 2. Re-init features when dea...

#

Aiden, could you clarify these things? (I think you'll need to explain some of them again, sorry. I can read back later, but currently pinning post. No hurry though)

bitter turtle Jul 9, 2023, 7:41 PM

#

ok so the circuit/smaller dataset stuff is basically because I slightly feel like you'd get more linearity/truthful representation by a sparse basis on [some] limited algorithmic tasks, and I wanted to see if that was the case.

keen pivot Jul 9, 2023, 11:20 PM

#

@pallid current when we do the big run to compare things, it’d be good to have a set seed for the data generation as opposed to just shuffling. Also, I believe the Pile is already shuffled by default

pallid current Jul 9, 2023, 11:23 PM

#

agree on the seed, what does it mean to shuffle by default? just the parameter that goes into load_dataset(shuffle=True)?

bitter turtle Jul 9, 2023, 11:53 PM

#

keen pivot <@566946805028225034> when we do the big run to compare things, it’d be good to ...

in the current setup, they all get fed the exact same data at the same time, just need to specify the seed at the start

keen pivot Jul 10, 2023, 1:55 AM

#

bitter turtle in the current setup, they all get fed the exact same data at the same time, jus...

True over the course of one run, but not two runs, right?

keen pivot Jul 10, 2023, 1:57 AM

#

pallid current agree on the seed, what does it mean to shuffle by default? just the parameter t...

I think shuffled in the shards by default

bitter turtle Jul 10, 2023, 2:10 AM

#

keen pivot True over the course of one run, but not two runs, right?

What do you mean by 'run' here? Also I was talking about the current setup for the big run on the pod sorry for not specifying that

keen pivot Jul 10, 2023, 2:37 AM

#

If the big run handles all hyperparams we care about it should be good, but it’d be good to still have a set seed for the data shuffle if we care about replicating later or think of some other setting we’d like to compare to.

keen pivot Jul 10, 2023, 6:58 PM

#

We got recommended a paper written a few years ago ( @pallid current , I forgot the person's name?) for variational sparse encoding: http://proceedings.mlr.press/v115/tonolini20a/tonolini20a.pdf.

Section 2 gives a nice overview of related work to contextualize it, but I'm confused on the claim of what normal sparse coding is missing that VAE's help w/ (or why not just an AE like our work?). Haven't read more than 10 min, but would appreciate if someone else could look at it!

bitter turtle Jul 10, 2023, 10:12 PM

#

@keen pivot willing to go through this with you so we can bounce thoughts off eachother; first thoughts are that

we get more control over the latent space by using a VAE, which might result in better learning/convergence if we set our hyperparams right
typically VAEs use a more powerful recognition model than our current encoders; probably useful if transformer representations are fundementally nonlinear/better denoising is needed than a simple RELU layer
I believe @blazing yoke was talking about this in #eliciting-latent-knowledge at one point, they might have been the one to reccommend you the paper. I'd be very interested in persuing this.

keen pivot Jul 10, 2023, 10:15 PM

#

bitter turtle <@360082080975290369> willing to go through this with you so we can bounce thoug...

Thanks!:)

bitter turtle Jul 10, 2023, 10:15 PM

#

Also a note on AEs vs VAEs: https://stats.stackexchange.com/questions/324340/when-should-i-use-a-variational-autoencoder-as-opposed-to-an-autoencoder is probably fine.

Cross Validated

When should I use a variational autoencoder as opposed to an autoen...

I understand the basic structure of variational autoencoder and normal (deterministic) autoencoder and the math behind them, but when and why would I prefer one type of autoencoder to the other? Al...

keen pivot Jul 10, 2023, 10:15 PM

#

If you'd like to take ownership of implementing it, that'd be great. I'm currently doing different sparsity constraints atm, and maybe even tied embeddings if we're not doing that in the big sweep

bitter turtle Jul 10, 2023, 10:16 PM

#

We can definitely do it in the big sweep.

#

Or, a sweep.

kind scroll Jul 10, 2023, 10:20 PM

#

keen pivot We got recommended a paper written a few years ago ( <@566946805028225034> , I f...

Hey - I mentioned the paper to Lee a few days ago so may have been me. I've been loosely keeping tabs on this thread. It's a good point that the paper didn't compare with normal sparse coding. A friend of mine wrote the paper, so I'd be very happy going through it with you. The answer aidan posted is pretty accurate. There's many principled differences between VAEs (which aren't really auto-encoders), and auto-encoders.

bitter turtle Jul 10, 2023, 10:21 PM

#

kind scroll Hey - I mentioned the paper to Lee a few days ago so may have been me. I've bee...

would absolutely love that, I have no formal grounding in VAEs or pretty much any principled machine learning, would find this v useful

kind scroll Jul 10, 2023, 10:22 PM

#

Generally, and in my experience, autoencoders need a lot more hacks for learning the kinds of representations you want. They suffer from things like mode collapse, and it turns out the isotropic gaussian latent space is kind of an okay choice in VAEs.

#

VAEs are super easy to implement though - I could show you the ML/code side in < hour, and the probabilistic theory in a couple hours.

keen pivot Jul 10, 2023, 10:23 PM

#

kind scroll Hey - I mentioned the paper to Lee a few days ago so may have been me. I've bee...

Yep, it was from Lee. Thanks!

kind scroll Jul 10, 2023, 10:24 PM

#

Ignore most of this code that I haven't touched in a long time, but this is the de-facto implementation. https://github.com/SalmanMohammadi/odd-one-out-representation-learning/blob/7989b74de0aa76f5a63dabda7baf1c0105adfd5a/models/models_disentanglement.py#L111

bitter turtle Jul 10, 2023, 10:33 PM

#

oh was about to say found one but would still value a summary paha

kind scroll Jul 10, 2023, 10:34 PM

#

Not sure if you're still looking for thoughts on the paper @bitter turtle - I hadn't seen it before but I've skimmed through it. Could you share the summary you found?

#

Intuitively, from my perspective, it follows a different probabilistic derivation than the VAE paper

bitter turtle Jul 10, 2023, 10:34 PM

#

https://arxiv.org/pdf/1705.07120.pdf

#

not the summary

#

but was cited by http://proceedings.mlr.press/v115/tonolini20a/tonolini20a.pdf and im confused about the inutuion

blazing yoke Jul 10, 2023, 10:35 PM

#

bitter turtle <@360082080975290369> willing to go through this with you so we can bounce thoug...

Yes, I was.

bitter turtle Jul 10, 2023, 10:36 PM

#

blazing yoke Yes, I was.

did you get anywhere with this?

#

one sec let me bring up the messages you sent originally

#

I guess you were approaching it from a slightly different perspective

blazing yoke Jul 10, 2023, 10:40 PM

#

https://arxiv.org/abs/2004.04092

arXiv.org

Optimus: Organizing Sentences via Pre-trained Modeling of a Latent ...

When trained effectively, the Variational Autoencoder (VAE) can be both a
powerful generative model and an effective representation learning framework
for natural language. In this paper, we propose the first large-scale language
VAE model, Optimus. A universal latent embedding space for sentences is first
pre-trained on large text corpus, and t...

#

Yeah, I ended up deciding the right thing to implement was probably this.

#

But I ended up doing other projects first.

#

Currently working on https://github.com/JD-P/minihf

GitHub

GitHub - JD-P/minihf: MiniHF is a web based interface to the (upcom...

MiniHF is a web based interface to the (upcoming) StableLM model. It is intended to let users collect their own feedback datasets using a simple lightweight interface. - GitHub - JD-P/minihf: Mini...

bitter turtle Jul 10, 2023, 10:41 PM

#

blazing yoke Yeah, I ended up deciding the right thing to implement was probably this.

could you elaborate on why?

blazing yoke Jul 10, 2023, 10:41 PM

#

There don't exist a lot of language VAE architectures.

#

But um, Katherine made a flow encoder thing you could add to it that would make it implement the right inductive bias for language.

bitter turtle Jul 10, 2023, 10:44 PM

#

blazing yoke But um, Katherine made a flow encoder thing you could add to it that would make ...

?

blazing yoke Jul 10, 2023, 10:44 PM

#

Which was inspired by https://arxiv.org/pdf/1908.11527.pdf

blazing yoke Jul 10, 2023, 10:45 PM

#

bitter turtle ?

Normal VAEs don't really work well for language

#

So you have to use like, an iVAE or one with hyperbolic geometry

#

We were going to combine Katherine's flow model with Optimus.

#

You'd probably be better off asking her for the technical details, since I didn't implement any of the flow model.

#

@opal basin

opal basin Jul 10, 2023, 10:48 PM

#

Oh, the normalizing flow part was to allow the autoencoder to learn distributions with arbitrary/weird shapes while retaining the ability to sample latents

#

Because with a VAE your posterior is normally diagonal Gaussian which is not good for language

#

We haven't actually tried it on text yet

#

If you don't care about the ability to sample from the distribution of latents, or determine the information content/likelihood of a latent, you can use a normal autoencoder (not VAE) which also lets it take on arbitrary shapes

blazing yoke Jul 10, 2023, 10:51 PM

#

Implementing this was still on our todo so, would be very happy if you did.

bitter turtle Jul 10, 2023, 10:52 PM

#

we'd be doing it for transformer internal representations not text, not sure if the non-gaussianity thingy still holds there

opal basin Jul 10, 2023, 10:53 PM

#

ahhh

bitter turtle Jul 10, 2023, 10:53 PM

#

why don't gaussians work for text

opal basin Jul 10, 2023, 10:54 PM

#

tbh i'm not clear on the exact details but they impose a Euclidean geometry on the space and for text you want the ability to stuff trees into the latents, which means you want hyperbolic geometry. this is kind of conjecture

#

"Adding Gaussian noise imposes Euclidean geometry and this is empirically not good for text" is the part that isn't conjecture i think.

kind scroll Jul 10, 2023, 10:57 PM

#

Euclidean geometry can have a bunch of things wrong with it https://arxiv.org/pdf/2002.05227.pdf

bitter turtle Jul 10, 2023, 10:57 PM

#

riiiight so would the conjecture would like imply that transformer internal reps are well described by something on hyperbolic geometry, so maybe it still applies here? we might end up using it if it works better, im adding to the list of evergrowing things to test at some point in the future

opal basin Jul 10, 2023, 10:57 PM

#

nods

#

I wanted to be able to sample from the distribution of latents and determine their likelihood so I took a normal autoencoder and added an RNODE (https://arxiv.org/abs/2002.02798) that converted from its latent space to N(0, I) and back.

arXiv.org

How to train your neural ODE: the world of Jacobian and kinetic reg...

Training neural ODEs on large datasets has not been tractable due to the
necessity of allowing the adaptive numerical ODE solver to refine its step size
to very small values. In practice this leads to dynamics equivalent to many
hundreds or even thousands of layers. In this paper, we overcome this apparent
difficulty by introducing a theoretical...

blazing yoke Jul 10, 2023, 10:59 PM

#

But um, to my memory I said a couple different things about VAEs and ELK. The most relevant one is probably that the use of a decoder only transformer makes ELK harder because you can't easily get embeddings and explore the latent space of the model and characterize it. A VAE is useful because it lets you estimate the information content of the latents.

opal basin Jul 10, 2023, 10:59 PM

#

opal basin I wanted to be able to sample from the distribution of latents and determine the...

It worked quite well for CIFAR-10.

#

The sampled fakes actually looked like real images, which you usually don't get fully with a VAE!

blazing yoke Jul 10, 2023, 11:00 PM

#

Which for ELK is important because it lets you figure out if your translator is hiding detail. If there's a mismatch between the amount of information in the latent and the explanation, you know something funny is going on.

bitter turtle Jul 10, 2023, 11:02 PM

#

opal basin I wanted to be able to sample from the distribution of latents and determine the...

don't understand, could you say this in slightly longer form, also what's this for?

blazing yoke Jul 10, 2023, 11:02 PM

#

Wasn't there some complication you ran into with scaling that caused you not to use it for text right away?

opal basin Jul 10, 2023, 11:03 PM

#

bitter turtle don't understand, could you say this in slightly longer form, also what's this f...

With a normal autoencoder you can't sample from the distribution of latents to be able to generate completely new fake images that resemble the training set, because it can take on any arbitrary distribution.

#

With a Gaussian VAE you try to approximate the posterior by sampling from the N(0, I) prior and decoding that, which you have tried to make the posterior (encoder output) resemble, and this kind of works

bitter turtle Jul 10, 2023, 11:04 PM

#

yeah but the RNODE thing

opal basin Jul 10, 2023, 11:04 PM

#

With my flow autoencoder you can sample from N(0, I) and run it through the flow model to obtain totally new latents to decode.

#

RNODE is a continuous normalizing flow, an invertible map between two arbitrary probability distributions.

#

In this case it maps between N(0, I) and whatever the learned distribution of encoder outputs is.

bitter turtle Jul 10, 2023, 11:05 PM

#

oh shit ok 👍 magic black box to convert distributions gotcha

kind scroll Jul 10, 2023, 11:05 PM

#

opal basin With a Gaussian VAE you try to approximate the posterior by sampling from the N(...

+1, but in practice a VAE works by learning the mu and sigma for N(mu, sigma), and you sample from that rather than N(0, 1)

opal basin Jul 10, 2023, 11:06 PM

#

kind scroll +1, but in practice a VAE works by learning the mu and sigma for N(mu, sigma), ...

But you'd like to be able to sample from the prior and get things that look like the training set but aren't. It just doesn't usually work that well in practice.

blazing yoke Jul 10, 2023, 11:06 PM

#

(Hers does though)

kind scroll Jul 10, 2023, 11:06 PM

#

for sure

blazing yoke Jul 10, 2023, 11:07 PM

#

blazing yoke Wasn't there some complication you ran into with scaling that caused you not to ...

But yeah, she was trying to figure out how to scale it and that's where we got sidetracked by other things I think.

bitter turtle Jul 10, 2023, 11:11 PM

#

Initially I don't think we care that much about sampling from the training distribution, plus superposition kind of conjectures that features follow ~a spike and slab distribution (https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478694) so I'm guessing that http://proceedings.mlr.press/v115/tonolini20a/tonolini20a.pdf will end up higher on our list of priorities

kind scroll Jul 10, 2023, 11:13 PM

#

Had you heard of the spike and slab distribution before?

bitter turtle Jul 10, 2023, 11:13 PM

#

nope

kind scroll Jul 10, 2023, 11:13 PM

#

When Francesco brought it up for his paper he said it was an (obscure?) Physics thing.

bitter turtle Jul 10, 2023, 11:14 PM

#

but it seems close to what superposition is saying; toy data for testing SAE methods is generated using ~that distrubution

errant nova Jul 10, 2023, 11:16 PM

#

I’m familiar with spike-and-slab regression

bitter turtle Jul 10, 2023, 11:16 PM

#

oh?

errant nova Jul 10, 2023, 11:18 PM

#

[deleted article because it’s not very accessible, let me look for a better one]

bitter turtle Jul 10, 2023, 11:18 PM

#

Is that just linreg with a spike+slab prior on the coefs

errant nova Jul 10, 2023, 11:20 PM

#

Yeah

#

It’s spiritually similar to ridge and lasso

#

It’s used to quickly work through a large number of mostly useless variables

#

http://www.batisengul.co.uk/post/spike-and-slab-bayesian-linear-regression-with-variable-selection/

Batı Şengül

Spike and slab: Bayesian linear regression with variable selection

Spike and slab is a Bayesian model for simultaneously picking features and doing linear regression. Spike and slab is a shrinkage method, much like ridge and lasso regression, in the sense that it shrinks the “weak” beta values from the regression towards zero. Don’t worry if you have never heard of any of those terms, we will explore all of the...

bitter turtle Jul 10, 2023, 11:23 PM

#

errant nova It’s spiritually similar to ridge and lasso

yeah honestly given my complete lack of background in this it's all very much a blur and they morph into one

errant nova Jul 10, 2023, 11:24 PM

#

At the end of the day it’s like quibbling over the names of different types of knives. Maybe useful for propel who really care about knives but you probably just want to make sure it’ll work and not cut you.

bitter turtle Jul 10, 2023, 11:24 PM

#

pahaha fair enough that's mildly encouraging

bitter turtle Jul 10, 2023, 11:37 PM

#

keen pivot If you'd like to take ownership of implementing it, that'd be great. I'm current...

If it's ok given we have so many random encoder/decoder things to try, I might spin some off to the AIS group I corun at my uni, we're trying to upskill ourselves atm with research projects. People would be @coarse flint and maybe @thorny cypress

keen pivot Jul 10, 2023, 11:44 PM

#

bitter turtle If it's ok given we have so many random encoder/decoder things to try, I might s...

Oh please do!

bitter turtle Jul 10, 2023, 11:45 PM

#

sound

pallid current Jul 11, 2023, 2:01 AM

#

bitter turtle Also a note on AEs vs VAEs: https://stats.stackexchange.com/questions/324340/whe...

bit confused about this in the answer: 'VAEs are known to give representations with disentangled factors [1] This happens due to isotropic Gaussian priors on the latent variables. Modeling them as Gaussians allows each dimension in the representation to push themselves as farther as possible from the other factors.' as i understand it, if you have an isotropic gaussian prior then the distribution should be invariant to rotation, which means that there's nothing distinct about the particular basis. where am i going wrong?

pallid current Jul 11, 2023, 2:25 AM

#

also finding it difficult to get an intuition for the role of the pseudoinputs in the VAEs. @kind scroll if you've got time to walk us through the paper some time in the next week that'd be brilliant!

opal basin Jul 11, 2023, 2:52 AM

#

pallid current bit confused about this in the answer: 'VAEs are known to give representations w...

the prior is isotropic, but the posterior is diagonal - the encoder outputs a mean and a log variance per dimension.

#

I suspect the answer is wrong and it's actually due to the posterior being diagonal and thus not rotation invariant

opal basin Jul 11, 2023, 2:59 AM

#

pallid current also finding it difficult to get an intuition for the role of the pseudoinputs i...

what are pseudoinputs?

bitter turtle Jul 11, 2023, 3:13 AM

#

instead of having a purely Gaussian prior you have a set of learned pseudoinputs which you feed into the latent space predictor to get your prior(s) which you then mix like mixture of gaussian

#

This is better because (?) and it incentivises high-variance posteriors/latents/eh

bitter turtle Jul 11, 2023, 3:15 AM

#

pallid current also finding it difficult to get an intuition for the role of the pseudoinputs i...

The paper that introduced it (?) https://arxiv.org/pdf/1705.07120.pdf does a decent job but I'm still confused

opal basin Jul 11, 2023, 3:15 AM

#

bitter turtle instead of having a purely Gaussian prior you have a set of learned pseudoinputs...

ohh

kind scroll Jul 11, 2023, 8:55 AM

#

pallid current bit confused about this in the answer: 'VAEs are known to give representations w...

This is correct, I think vanilla VAEs don't have true disentanglement because their latent spaces can be rotated arbitrarily

kind scroll Jul 11, 2023, 8:55 AM

#

pallid current also finding it difficult to get an intuition for the role of the pseudoinputs i...

Be happy to!

cosmic moon Jul 11, 2023, 3:02 PM

#

kind scroll This is correct, I think vanilla VAEs don't have true disentanglement because th...

i believe beta-VAEs are supposed to be the "extra disentangled" flavor

bitter turtle Jul 11, 2023, 3:03 PM

#

cosmic moon i believe beta-VAEs are supposed to be the "extra disentangled" flavor

yes, but they only have isotropic gaussian priors, and we are wondering why that prior encourages disentanglement

kind scroll Jul 11, 2023, 3:04 PM

#

I'll link a couple papers later today. The short answer is: it doesn't really. It's partly just nice for an analytical form of the evidence lower bound.

cosmic moon Jul 11, 2023, 3:10 PM

#

re PCA as a baseline, fitting a GLoRA decomposition might be interesting to investigate as well. https://arxiv.org/abs/2306.07967

arXiv.org

One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning

We present Generalized LoRA (GLoRA), an advanced approach for universal
parameter-efficient fine-tuning tasks. Enhancing Low-Rank Adaptation (LoRA),
GLoRA employs a generalized prompt module to optimize pre-trained model weights
and adjust intermediate activations, providing more flexibility and capability
across diverse tasks and datasets. More...

keen pivot Jul 11, 2023, 5:09 PM

#

Review of different methods:
l_1/2 - Nothing noticeably different than l_1 (no better sparsity for the same reconstruction loss). Worse high MCS (though maybe optimizing for an even better alpha would work). Didn't look at individual features.
Adding noise - can't compare (sparsity, reconstruction loss) because it's at a strict disadvantage to the non-noise one (would need to compare on clean data for both). Learns maybe 1/3 number of features (boo) but 0 dead features (yaaay).

Looking at individual features, many appear just as meaningful as l1, however, the logit lens is much worse for some of them (Logit lens being worse isn't as meaningful as ablated direction being just the same). Additionally, there are like 10 features for (beginning and end of first sentence), whereas l1 only has 1 of those.

Could try tied embedding for both l1 & noise to compare.

#

Though, I'm unsure how normalizing the weights of the decoder (which are now the encoder) will effect things?

bitter turtle Jul 11, 2023, 5:15 PM

#

awesome

#

I've just about got the big sweep code debugged we can set it off tomorrow

#

https://github.com/Baidicoot/sparse_coding/blob/main/big_sweep.py <- someone think of good hyperparam settings

GitHub

sparse_coding/big_sweep.py at main · Baidicoot/sparse_coding

Work on sparse coding, replicating and extending the sparse coding approach to taking transformer features out of superposition. - sparse_coding/big_sweep.py at main · Baidicoot/sparse_coding

keen pivot Jul 11, 2023, 5:17 PM

#

I guess I'm confused about implementing the tying. Like I want them to be in the same direction, but I'd be okay w/ different biases. Same w/ only normalizing the decoder

keen pivot Jul 11, 2023, 5:18 PM

#

bitter turtle https://github.com/Baidicoot/sparse_coding/blob/main/big_sweep.py <- someone thi...

Residual stream or MLP? & any specific hyperparams?

bitter turtle Jul 11, 2023, 5:19 PM

#

can do either not sure which one we should do first

bitter turtle Jul 11, 2023, 5:19 PM

#

keen pivot Residual stream or MLP? & any specific hyperparams?

I guess just good lr values, good values to test, also not a hyperparam but whatever we want to keep track of

keen pivot Jul 11, 2023, 5:24 PM

#

For tracking, I want features/datapoints ('sparsity') & dead features:

dead_features = (dict_levels.detach().mean(dim=0)==0).count_nonzero().item()```

#

Default lr has been good to me.

pallid current Jul 11, 2023, 5:27 PM

#

keen pivot For tracking, I want features/datapoints ('sparsity') & dead features: ```spars...

what length of time are you calculating dead_features over? if i remember correctly, if you look over a long enough period of time you rarely see any dead features, it just drops off over time

keen pivot Jul 11, 2023, 5:30 PM

#

pallid current what length of time are you calculating dead_features over? if i remember correc...

I think pretty long. Looks like 1000 batches

pallid current Jul 11, 2023, 5:38 PM

#

ok fair, in that case i think we'll want to measure the average activation, or total number of non-zero activations to understand whether they're dying

#

maybe even also average activation, given that it's active

keen pivot Jul 11, 2023, 5:39 PM

#

and average activation for dead features will be 0

pallid current Jul 11, 2023, 5:39 PM

#

like, we want to distinguish between 99.99% 0 activity and something a tiiiiny bit, vs usually 0 with occasional strong activation, which could be a healthy feature

keen pivot Jul 11, 2023, 5:39 PM

#

I'm unsure, could be very rare features

pallid current Jul 11, 2023, 5:40 PM

#

keen pivot I'm unsure, could be very rare features

yeah possibly tho i think we found that it was super rare to have high MCS at those low activation frequencies. not conclusive for sure tho cos you'd expect rare feats to be found less often

keen pivot Jul 11, 2023, 6:55 PM

#

Update: Ya, adding noise (instead of an l1) does produce sparsity & sometimes meaningful features, but not as good as l1 & importantly the logit lens just sucks w/ it (compared to l1)

keen pivot Jul 11, 2023, 6:59 PM

#

pallid current yeah possibly tho i think we found that it was super rare to have high MCS at th...

Okay, I now don't feel strongly about dead-features/average_activations.

keen pivot Jul 11, 2023, 9:09 PM

#

@bitter turtle, early results are in & tied embedding looks quite good. I believe you're the one that suggested both tied embeddings & residual, both of which have been really good, so thanks!:)

Additionally, tied embeddings may allow it to work on the MLPs if we want.

#

Of biggest note: I'm getting ~1.4x as many features w/ the same amount of data. Plus the added benefit of reading in from the same direction we're writing out (though I have two different biases because they're different shapes)

Of also big note: average features per token went from 100 (untied embedding) to ~20 (tied embedding) when optimizing over L1_alpha for high-MCS.

pallid current Jul 11, 2023, 9:28 PM

#

keen pivot Of biggest note: I'm getting ~1.4x as many features w/ the same amount of data. ...

features = MCS > 0.9?

keen pivot Jul 11, 2023, 9:35 PM

#

pallid current features = MCS > 0.9?

Yep, though maybe it should be 0.8 as the heuristic

pallid current Jul 11, 2023, 9:35 PM

#

sweet, how many you getting on what activation dim?

keen pivot Jul 11, 2023, 9:36 PM

#

pythia-70M, so 500

#

Looks like 1k for the 2k dict

#

Probably more if I went larger

#

And the features & logit lens makes a lot of sense.

pallid current Jul 11, 2023, 10:32 PM

#

fiiiiiinally managed to replicate @keen pivot's interpretations of high MCS neurons using the openai autointerp system

#

no idea what the bug was but yeah should be able to have more confident autointerp results coming along in the next few days and can actually help with the main bulk of the work

#

i'm a bit behind but i think the first thing to do is a comparison of high MCS / low MCS? most important is to do sparse coding vs neuron basis, but i think we need to adjust for the negative biases before we can call this a fair comparison

keen pivot Jul 11, 2023, 11:17 PM

#

pallid current no idea what the bug was but yeah should be able to have more confident autointe...

I should have a better dictionary to link you to on the bucket by tomorrow.

bitter turtle Jul 11, 2023, 11:18 PM

#

pallid current i'm a bit behind but i think the first thing to do is a comparison of high MCS /...

I think the comparison is fine as long as we make a note of the concerns when we report on it; conditional on the SAE having some reconstruction loss below a threshold I don't see why we couldn't compare them directly.

#

For sure you would be less able to view the directions discovered by the SAE as wholly meaningful if it turned out the bias mattered a lot but I'm thinking that even in that eventuality you could still get some millage out of the representation.

pallid current Jul 11, 2023, 11:25 PM

#

keen pivot I should have a better dictionary to link you to on the bucket by tomorrow.

oh lit yeah that'd be good

pallid current Jul 11, 2023, 11:30 PM

#

bitter turtle I think the comparison is fine as long as we make a note of the concerns when we...

hmmm i think this would definitely be fair if we could show that performance degradation when replacing the activations with the reconstruction was fairly low, but i don't think that's likely any time soon, and i'm not sure what other threshold for sufficiently low recon loss we could use

#

am more excited by just comparing neuron basis with negative bias + relu applied, but will also compute the raw comparison

pallid current Jul 11, 2023, 11:52 PM

#

the features are so much richer now that it's fixed 😊

#

Feature 8, explanation='phrases and keywords related to the legal system, law and legislature.'
Feature 8, score=0.29
Feature 9, explanation=' underscore characters, especially in the context of code or programming syntax.'
Feature 9, score=0.26
Feature 10, explanation='terms related to astronomy and cosmology.'
Feature 10, score=0.41

pallid current Jul 12, 2023, 12:55 AM

#

this is without the aforementioned adjustments to the neuron basis so take with a pinch of salt

#

but:

#

though also, i'm not seeing a strong relationship btwn MCS and autointerp score

bitter turtle Jul 12, 2023, 12:57 AM

#

How many tests is that, just wondering what levels would give less noisy data

pallid current Jul 12, 2023, 1:13 AM

#

roughly 60, 60, 120. can scale up easy tho i think the difference is very clearly significant

#

neuron basis vs sparse code is like 5 sig diff in means

#

missed a few actually so new one is

keen pivot Jul 12, 2023, 1:21 AM

#

That’s awesome!

#

I also feel proud of the “logan_ae”, haha

pallid current Jul 12, 2023, 1:22 AM

#

keen pivot That’s awesome!

fun but shouldn't get excited until we apply the negative bias, i expect impact to be large

pallid current Jul 12, 2023, 2:06 AM

#

hmmmm first indication is that adding the relu(activation + bias) makes the neuron_basis interp worse, good if true but very surprising to me

#

i suppose it could cut off legit activations, making the simulation less accurate

#

still, need to do a bit of checking before i feel confident

pallid current Jul 12, 2023, 2:29 AM

#

ok i think it's working at intended !

#

#

graph's a mess but it's looking really good! more hyped about sparse coding than i've been in ages!

keen pivot Jul 12, 2023, 3:02 AM

#

How'd you do neuron-basis-bias?

pallid current Jul 12, 2023, 5:05 AM

#

just took random biases from the encoder and added them to the neuron output, then added a relu

#

all biases were negative so it makes some level of sense

#

i realised that the bias should be scaled by the norm of the encoder tho, might implement that tonight, otherwise tomorrow morn

#

wonder if there's a more principled way. in the openai autointerp paper they get the gradient of the feature wrt the interp score, but i think that would then be unfair in the other direction, as well as sounding like a pain

#

you could also target a particular sparsity of activation

bitter turtle Jul 12, 2023, 7:39 AM

#

Would love to see the ICA and PCA baselines, but that looks crazy good!

keen pivot Jul 12, 2023, 1:35 PM

#

@pallid current the autointerp develops hypotheses from max-activating examples, right? For me it's misleading because the max-activating are sometimes too specific, so I look at a uniform distribution.

#

Also, is this the MLP dictionary?

#

Do you have a list of goals or milestones for auto-interp?

pallid current Jul 12, 2023, 2:30 PM

#

bitter turtle Would love to see the ICA and PCA baselines, but that looks crazy good!

Will run this morning

pallid current Jul 12, 2023, 3:51 PM

#

keen pivot Do you have a list of goals or milestones for auto-interp?

hmm i mostly see autointerp in the near term as being something which can help give us a better signal on the quality of our learned dictionaries, so i'm keen to join back into the main effort and bring in autointerp when we need signal for what's working

#

meeting lee in a sec so will try use that to write a proper plan of things to do but i plan to:

do a bit more work to try and make sure that the comparison with e.g. neuron basis is a fair one
write a quick post showing the results
also, if the big sweep goes well, getting autointerp scores for sparse coding and a few baselines on lots of different layers would be worthy of a major writeup, possibly a paper imo, i think it would get a lot of people interested

#

will try to have the preliminary writeup soon and then talk to anthropic with that in hand

keen pivot Jul 12, 2023, 3:57 PM

#

These are the things I thought of for auto-interp. I can try implementing them in a week or two (though looks like you've got PCA/ICA covered!):

Improving prediction of input:
- Include ablated context one-token-at-a-time effect
Predicting ablated output
PCA & ICA (what do top components look like?)
"Interesting" directions (like accents, medical-speak, SE-speak) & in general categorizing features

pallid current Jul 12, 2023, 3:59 PM

#

pallid current Will run this morning

getting OOM errors on PCA 😦 not sure what's changed, might just be using more data now, will try both switching to @bitter turtle's batched version and just reducing the amount of data fed in.

bitter turtle Jul 12, 2023, 4:00 PM

#

Re: dead neurons, I'd also like to get around to doing the dead-neuron reinitialisation (and maybe low-MCS reinitialisation?) at some point

pallid current Jul 12, 2023, 6:00 PM

#

bitter turtle Re: dead neurons, I'd also like to get around to doing the dead-neuron reinitial...

i can have a look at this if you're not about to do it immediately, though i think we should look carefully at some autointerp stuff before doing it because i didn't see strong evidence that low-MCS were significantly less interpretable. logan might have more evidence on this? i want to check if there's a stronger pattern for activation frequency or magnitude

bitter turtle Jul 12, 2023, 6:05 PM

#

pallid current i can have a look at this if you're not about to do it immediately, though i thi...

I'm also kind of thinking of this as another method to test goodness of MCS as a metric

pallid current Jul 12, 2023, 6:07 PM

#

what's the measure that would tell us whether MCS is good? just whether reinitialization of low-MCS produces better dicts?

bitter turtle Jul 12, 2023, 6:07 PM

#

sure

#

definitely sketchy, I think we should also look into just accumulating tons of other possible (non-ai-supervised?) metrics

pallid current Jul 12, 2023, 6:10 PM

#

in current runs, are we still seeing recon loss and l1 loss rising through the run?

#

i feel like that should be a bigger part of our metrics than it is

bitter turtle Jul 12, 2023, 6:28 PM

#

^

pallid current Jul 12, 2023, 6:28 PM

#

suppose at large intervals we could run the perplexity check that we spoke about before, run some comps to see how well the model is able to function with different dicts

#

btw recently tried the method of applying biases from the encoder but where the bias is scaled down by the norm of the autoencoder, doesn't help the autointerp on the neuron basis at all

keen pivot Jul 12, 2023, 6:40 PM

#

pallid current i can have a look at this if you're not about to do it immediately, though i thi...

I haven't checked for the latest tied embedding, but if low-MCS feature is dead, then that matters, but not in the interesting way.

It does seem plausible that larger dicts may learn or retain different features than smaller ones. Larger ones tend to have slightly larger average features/tokens, so may overwrite different features even on the same dataset.

#

Regarding bias, the tied embedding has this for the encoder:

#

I looked at a few in the right cluster & they're typically dead (~0 non-zero activations). This is ~400/2k features. So maybe just originally init bias of encoder to uniform[-1, -3]

#

I will note, I've noticed before the one odd positive bias feature. It like kind'of looks like a feature regarding the input, but doesn't have a meaningful affect on the output nor meaningful logit lens.

#

I was going to look at a data-centric viewpoint: given a datapoint, how many features activate? Do those features make sense? For example, if most of them make sense, but this positive bias one always activates, then that's a clue that there's funny business going on.

bitter turtle Jul 12, 2023, 7:19 PM

#

keen pivot I looked at a few in the right cluster & they're typically dead (~0 non-zero act...

not convinced this is a good idea; want to wait until we do some runs with weight decay on the bias

keen pivot Jul 12, 2023, 7:39 PM

#

bitter turtle not convinced this is a good idea; want to wait until we do some runs with weigh...

Gotcha. Are you able to explain your intuition here?

bitter turtle Jul 12, 2023, 8:26 PM

#

well, ideally we want to find directions corresponding with useful features, and the point of the bias + relu is to act as a bit of a noise-reducer to cancel out the interference effect of other features, which shouldn't* be anything too significant

*we should check the variance of activations, but if they are even like less than 100 (or 1000?) or something (ballpark orders of magnitudes) then the bias is doing something more than basic denoising

pallid current Jul 12, 2023, 8:29 PM

#

keen pivot I looked at a few in the right cluster & they're typically dead (~0 non-zero act...

this is suuper weird to me, i wouldn't have thought it was possible to be dead without fairly big negative activation, though i spose the space is so large that it's possible to have lots of halves or large segments that are totally dead, like i remember there being some results about the clustering of embedding vectors in smallish parts of the residual stream

#

agree that initializing negative biases doesnt sound good for similar reasons to aidan. like, you find a direction, and then make the bias negative to remove the noise, but if you just start with large negative bias you're quite likely to just find nothing at all

pallid current Jul 12, 2023, 8:31 PM

#

keen pivot I looked at a few in the right cluster & they're typically dead (~0 non-zero act...

whats the diff btwn the two graphs?

keen pivot Jul 12, 2023, 8:54 PM

#

pallid current whats the diff btwn the two graphs?

y-axis. One is MCS & other is non-zero activations

keen pivot Jul 12, 2023, 8:56 PM

#

pallid current this is suuper weird to me, i wouldn't have thought it was possible to be dead w...

Wouldn't this just mean that the weights feeding into the ~0 bias tend to sum to negative (I want to say negative, but residual stream activations are negative and positive)

keen pivot Jul 12, 2023, 9:01 PM

#

bitter turtle not convinced this is a good idea; want to wait until we do some runs with weigh...

Are we also running weight decay on the weights themselves? This is referring more to the anthropic information metric here.

pallid current Jul 12, 2023, 9:08 PM

#

keen pivot Wouldn't this just mean that the weights feeding into the ~0 bias tend to sum to...

ohhh sorry yeah of course there's a huge reduction in the space post nonlin

#

only applies to resid stream

#

which are the graphs from?

pallid current Jul 12, 2023, 9:09 PM

#

keen pivot Are we also running weight decay on the weights themselves? This is referring mo...

not unless something's changed recently

keen pivot Jul 12, 2023, 9:20 PM

#

Residual stream, tied

bitter turtle Jul 12, 2023, 9:40 PM

#

keen pivot Are we also running weight decay on the weights themselves? This is referring mo...

Hmm I guess maybe we might want to, but I'm not sure how good their metric was or whether a norm on the matrix elements would improve it

#

So short term probably not is my take

pallid current Jul 12, 2023, 10:19 PM

#

i'm still interested in this on theoretical grounds because it seems like we should expect features to be composed of a small-ish number of neurons if it is using the non-lin in the way we expect. it got semi-shelved when we found that the features sparse coding was learning were weirdly less sparse than random vectors. but i understand that seems to have been an artefact of working with the nanoGPT model?

bitter turtle Jul 12, 2023, 10:57 PM

#

uh. how are you defining sparse here?

#

im just confused how they can be less sparse than random vectors, which should just be not sparse

pallid current Jul 12, 2023, 10:59 PM

#

bitter turtle uh. how are you defining sparse here?

a diversity metric on the vector, (simpson index in here https://en.wikipedia.org/wiki/Diversity_index)

#

maximally nonsparse by this metric would be every element being equal

#

comes out basically the same as entropy

bitter turtle Jul 12, 2023, 11:02 PM

#

hmm. is that the definition we want for sparse here?

pallid current Jul 12, 2023, 11:02 PM

#

maybe not exactly but i think it would give a strong signal if the features were focussing on only a few vectors

#

what dyou think's missing there?

bitter turtle Jul 12, 2023, 11:03 PM

#

some sort of centering at zero, but yeah would give out strong signals for sure

#

don't think you can read too much into the sparse coding vs random vectors thing tho, it seems that random vectors should be close to totally unsparse by a better definition

#

I mean, I literally think 'normal sparsity but close-to-zero' seems reasonable. (or maybe if some coef contributes say 1/100*n_dims of a vectors norm or something arbitrary like that)

pallid current Jul 13, 2023, 12:40 AM

#

bitter turtle don't think you can read too much into the sparse coding vs random vectors thing...

i think they are pretty much totally unsparse, so less sparse than that was shocking! either way tho i think its an outofdate result

pallid current Jul 13, 2023, 3:51 PM

#

current residual stream results. here the '''neuron basis''' seems to be good?? and matches the sparse-coded features.. but this time the top MCS features are significantly better, and both far out perform random. the green is the sparse coded features after 1 epoch which is somehow worse than random (low sample size)

keen pivot Jul 13, 2023, 3:54 PM

#

Are you able to look at specific examples? There's 3 in the neuron basis that have a score of 0.5 (maybe 6 for .5 & 5.5 in total?)

pallid current Jul 13, 2023, 3:55 PM

#

i think these results are basically positive and show that we are getting something legitimate from our dictionaries, and we need to both scale up to more layers, and refine our learning process

keen pivot Jul 13, 2023, 3:55 PM

#

Also, would -1 mean reversing it's answer would give us 1? ie reverse_score = abs(10-oldscore) or something like that?

pallid current Jul 13, 2023, 3:56 PM

#

yeah like the explanation is perfectly anticorrelated with the activations

keen pivot Jul 13, 2023, 3:56 PM

#

pallid current i think these results are basically positive and show that we are getting someth...

I liked your earlier statement of quickly being able to check if a new setting (ie bias or tied) actually helps improve.

pallid current Jul 13, 2023, 4:00 PM

#

keen pivot Are you able to look at specific examples? There's 3 in the neuron basis that ha...

yeah i can give you the ids and explanations, ids are [1, 7, 37, 66, 69, 90] and explanations are all pretty similar: numeric values, sequences, and lists., dates, particularly those written in the format of month and year numeric values and codes, including year dates and programming syntax., numerical data and product identifiers ,numerical values and sequences numeric values, including single digits, multi-digit numbers, and percentages.

keen pivot Jul 13, 2023, 4:08 PM

#

Something else I've noticed in the residual stream is we're picking up the weird language model stuff, like the 8bit guy? Oh here: https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/

Tim Dettmers

LLM.int8() and Emergent Features — Tim Dettmers

When I attended NAACL, I wanted to do a little test. I had two pitches for my LLM.int8() paper. One pitch is about how I use advanced quantization methods to achieve no performance degradation transformer inference at scale that makes large models more accessible. The other pitch talks about emergent outliers in transformers and how […]

#

For Pythia 1.4b, there's a HUGE activation of 1.5k. Typical activations are in the 1-80 range, w/ maybe 5 as the median max value

#

One hypothesis is we're picking up on what Tim called these outlier dimensions that models have, which coordinate in the 6Billion range & get really big. Smaller models have lots of these, but they're not coordinated, so you get several of them in the 60 activation range.

keen pivot Jul 13, 2023, 4:15 PM

#

pallid current current residual stream results. here the '''neuron basis''' seems to be good?? ...

Oh, the top_mcs ones are mostly simple ones. Of the top 40, maybe 30 are single category words (e.g. his/her/their) or simple patters (e.g. the token right after any letter "L"-containing tokens)

#

I'm curious how it does regarding the more style-based ones. Like the SE style or Chemistry, etc

pallid current Jul 13, 2023, 5:18 PM

#

keen pivot Something else I've noticed in the residual stream is we're picking up the weird...

blog is v good. surprising though as he seems to have clear evidence that those features emerge at a level quite a way above what we'd expect out of pythia70M. he says it's a perplexity based threshold but i cant imagine py70m being capable of matching OPT6.7B

keen pivot Jul 13, 2023, 5:20 PM

#

pallid current blog is v good. surprising though as he seems to have clear evidence that those ...

A bit more subtle than that. He claims that these outlier dimensions exist in models as small as 160M, but they start aligning in the 6B region where they shoot up really huge

#

Which is exactly what I noticed: Pythia-70M has several "beginning & ending of sentence" in the range 60 activations. Pythia 1.4b has 1 of the same type that's 1500 & another that's in the 60 activation range.

#

Two things of due diligence for myself:

Check that the direction mainly comes from one dimension (which I believe Tim is claiming)
Check the amount of features of this type for both models, plus their activation range

pallid current Jul 13, 2023, 5:24 PM

#

keen pivot A bit more subtle than that. He claims that these outlier dimensions exist in mo...

ahh cheers yes i should have kept reading. your 1600 values is way off the charts still tho:
"The phase shift happens around 6.7B, where 100% of layers use the same dimension for outliers. At this point, a couple of things happen rapidly:

Outliers become very large quickly. They grow from about 15 for a 6B model to about 60 for a 13B model. OPT-66B has outliers of size around 95, which indicates this growth phase is temporary."

#

can imagine these things being very senstive to training setups tho

keen pivot Jul 13, 2023, 5:28 PM

#

Ya this first one (the 1500 one) is mainly two dimensions:

#

For comparison, here's top-MCS 100 which is the feature "the [noun]"

pallid current Jul 13, 2023, 5:30 PM

#

plotting the contribution of each dimension?

keen pivot Jul 13, 2023, 5:31 PM

#

The weight of the decoder & it's tied

#

This is the other one which is activation 50 for only first words (as opposed to beginning & ending):

pallid current Jul 13, 2023, 5:32 PM

#

keen pivot For comparison, here's top-MCS 100 which is the feature "the [noun]"

resid or mlp?

keen pivot Jul 13, 2023, 5:33 PM

#

keen pivot Jul 13, 2023, 5:33 PM

#

pallid current resid or mlp?

residual

#

Though looking back at the google sheet, It (ie the feature "beginning & ending of sentences" w/ high activations) is also the top-MCS feature for layer 2,3,4 MLP

#

Additionally there's a high-positive bias one w/ similar properties:

#

Notably there are several overlapping residual dimensions (e.g. 568, 516, 468, 1326, 934, etc) w/ the first high-MCS & high-activations one.

#

Notably notably, they seem like opposite of each other when you look at activating examples.

bitter turtle Jul 13, 2023, 5:58 PM

#

Wrt getting disentangled features for VAEs: https://arxiv.org/abs/1907.04809

arXiv.org

Variational Autoencoders and Nonlinear ICA: A Unifying Framework

The framework of variational autoencoders allows us to efficiently learn deep
latent-variable models, such that the model's marginal distribution over
observed variables fits the data. Often, we're interested in going a step
further, and want to approximate the true joint distribution over observed and
latent variables, including the true prior ...

keen pivot Jul 13, 2023, 5:59 PM

#

pallid current ahh cheers yes i should have kept reading. your 1600 values is way off the chart...

Oh ya, the 1600 value part may be caused by our normalized decoder(?) Haven't thought about it much

bitter turtle Jul 13, 2023, 6:31 PM

#

keen pivot For Pythia 1.4b, there's a HUGE activation of 1.5k. Typical activations are in t...

The fuck

bitter turtle Jul 13, 2023, 6:34 PM

#

keen pivot Oh ya, the 1600 value part may be caused by our normalized decoder(?) Haven't th...

Yeah how are you measuring this

keen pivot Jul 13, 2023, 6:35 PM

#

bitter turtle Yeah how are you measuring this

Measuring what? The 1.5k activation is the top-MCS feature activation for the learned dictionary for the 1.4b model in residual stream

pallid current Jul 14, 2023, 12:00 AM

#

more graph horror but finding that the sparse coding directions strongly beat ica and pca, as well as all others, on pythia layer1 mlp

#

principal components start interpretable but fade quickly

pallid current Jul 14, 2023, 1:25 AM

#

#

also started separating out the top vs random scoring. results are generally good, the score goes down, but for any where the top score is >0.2, the random score is also almost always solidly positive

#

average drop of maybe 0.1

keen pivot Jul 14, 2023, 2:26 AM

#

Oh that's awesome! Glad you've been working on this!

keen pivot Jul 14, 2023, 4:00 AM

#

pallid current

Are the top PCA/ICA directions just these outlier dimensions?

#

Their predicted text is usually pretty easy too

#

Oh, would be good to rerun toy example stuff on tied embedding autoencoder.

#

I think I'm getting dead neurons by overtraining.

pallid current Jul 14, 2023, 5:49 AM

#

i'm not seeing a correlation between activation level and interp score 🤔, also not with interp and MCS

keen pivot Jul 14, 2023, 1:33 PM

#

pallid current i'm not seeing a correlation between activation level and interp score 🤔, also ...

Note this is tied residual, and my previous correlations were about untied (and MLP sometimes)

keen pivot Jul 14, 2023, 2:07 PM

#

Things are looking good on the MCS & MMCS front. For residual, we're getting MMCS of 0.8 for 4k (d_model=500), 50% above 0.9, and the histograms look really good.

#

I'm getting the same results for layers 1-4, will additionally do layers 5 (& 0 for the heck of it). Usually I'd need a lot more data to get almost these good results for untied embedding, and the features/token here is ~20 where previously it was ~100

#

20 also seems right because you have features like "a-letter words" & "after a-letter words" along w/ a bunch of low-activation "noise" which also tends to make sense on it's own, but also might point towards a problem/

keen pivot Jul 14, 2023, 2:34 PM

#

@bronze wraith , would you be able to think through the math of a tied embedding & what would be best to best reconstruct the original data?

As an example, suppose the weights are [1,2] & the latent feature is 10. Then the reconstructed feature is [10,20].

But if the input was [10,20] then the latent feature would be 50. I guess you could have a negative bias of -40 for the encoder for it to work on this example, but that's as far as I got & thought you'd be better at this kind of thing.

#

(additionally, we're normalizing the decoder weights because of the l1 penalty for typical dictionary learning, but that now means we're also normalizing the encoder weights)

bronze wraith Jul 14, 2023, 2:37 PM

#

keen pivot <@748975058415910923> , would you be able to think through the math of a tied em...

I'm rather behind on this thread, can you explain what you mean by tied embeddings (or point me to an explanation)?

#

Also, I'll give it a shot, but I'm about to leave for a weekend trip, so I might not have much to say until next week!

keen pivot Jul 14, 2023, 2:37 PM

#

The encoder and decoder are the same linear transformation, but transposed

keen pivot Jul 14, 2023, 2:38 PM

#

bronze wraith Also, I'll give it a shot, but I'm about to leave for a weekend trip, so I might...

Oh, no problem. Enjoy your trip!:)

#

Here is the relevant bits of the model definition.

        self.decoder = nn.Linear(n_dict_components, activation_size, bias=True)
        # Create a bias layer
        self.encoder_bias= nn.Parameter(torch.zeros(n_dict_components))

        # Encoder is a Sequential with the ReLU activation
        # No need to define a Linear layer for the encoder as its weights are tied with the decoder
        self.encoder = nn.Sequential(nn.ReLU())

    def forward(self, x):
        c = self.encoder(x @ self.decoder.weight + self.encoder_bias)
        # Apply unit norm constraint to the decoder weights
        self.decoder.weight.data = nn.functional.normalize(self.decoder.weight.data, dim=0)

        # Decoding step as before
        x_hat = self.decoder(c)
        return x_hat, c

bronze wraith Jul 14, 2023, 2:41 PM

#

One thing I think I can say about tied embeddings: if your encoding matrix is M=[I_n, -I_n], I think you get a perfect reconstruction (i.e. M^T ReLU(Mx)=x for all input vectors x).

#

This works because Mx is [x, -x], whose negative terms get zero'd out by the ReLU, and then when it is multiplied by M^T you get back the original.

#

This is a solution to "given that our embeddings will be tied, what dictionary features could we learn to get a good reconstruction", but doesn't account for a) the L1 penalty, or b) the noisyness of the training process. And this solution isn't unique: you can get a perfect reconstruction with tied embeddings M=[U, -U] for any unitary matrix U (https://en.wikipedia.org/wiki/Unitary_matrix).

#

Those are some examples I thought through before, but Logan I'm sure you had some other questions in mind too. What other angles do you want to think through this from?

keen pivot Jul 14, 2023, 2:56 PM

#

bronze wraith Those are some examples I thought through before, but Logan I'm sure you had som...

Oh mostly if there's a better architecture to use & to gain clearer thinking about this in general. Like "we for sure need biases" or "normalizing the encoder is strictly worse than ..."

bronze wraith Jul 14, 2023, 3:21 PM

#

keen pivot Oh mostly if there's a better architecture to use & to gain clearer thinking abo...

I'll think about this over the weekend, and let you know if I come up with anything!

keen pivot Jul 14, 2023, 3:21 PM

#

bronze wraith I'll think about this over the weekend, and let you know if I come up with anyth...

Great, thanks!:)

#

Also, no problem if you just enjoy your trip!

bitter turtle Jul 14, 2023, 5:02 PM

#

bronze wraith This is a solution to "given that our embeddings will be tied, what dictionary f...

@keen pivot generally we should note that the L1 penalty basically solves this rotation-invariance if the underlying latents are sparse (rotation doesn't preserve L1 norms, and L1 norms are minimised when the rotation produces as sparse data as possible). Not certain how interference affects this when we have an overcomplete basis, but for e.g. binary features with some constant maximum interference between distinct features you get minimum necessary bias for total denoising when the rotation is aligned to the underlying overcomplete basis (this is basically the reason that I think L2 norm on the bias is a good idea; also in this case - perfect denoising - you get perfect reconstruction with the tied weights)

Of course, real life is More Complicated than this, but I guess this kind of explains the intuition for the guess of 'hmm tying weights seems like a goodish idea'

pallid current Jul 14, 2023, 8:12 PM

#

bitter turtle <@360082080975290369> generally we should note that the L1 penalty basically sol...

really like and agree on the first bit but why l2 norm on the bias? is the idea that we should try to remove only as much of the interference as it necessary, but no more? if so agree but then i think the algo should already do that? (though i suppose empirically seesm the biases are just creeping more n more negative so yeah maybe necessary)

#

also is there a theoretical reason why we in the tied autoencoder we have a bias on the decoder? i don't have strong opinions either way but i remember @worldly hinge being quite anti it, cant remember why

keen pivot Jul 14, 2023, 8:38 PM

#

pallid current also is there a theoretical reason why we in the tied autoencoder we have a bias...

The only reason is to allow the decoder to still reconstruct statistics of the data w/o dedicating weights to it. I haven't tried w/o it, but that's pretty easy to try

pallid current Jul 14, 2023, 9:35 PM

#

i promise i'll fix these graphs soon but......

#

separating out top and random scoring, sparse coding outperforms all baselines for both top and random on the residual stream!!

#

bitter turtle Jul 14, 2023, 9:41 PM

#

pallid current really like and agree on the first bit but why l2 norm on the bias? is the idea ...

Partially but also that we can obtain the minimum necessary bias to remove all noise by aligning the dictionary to the latents, so the L2 also encourages disentanglement. I think. Hopefully.

bitter turtle Jul 14, 2023, 9:42 PM

#

pallid current also is there a theoretical reason why we in the tied autoencoder we have a bias...

Would be interested to know. I think we should be adding the bias back on , surprised he is anti this. I am blindly guessing tho

pallid current Jul 14, 2023, 9:42 PM

#

yeah adding the bias back on makes sense to me too

#

@bitter turtle what's the status of 'the big run'? are we ready to try out multiple layers * multiple dictionary learning approaches

bitter turtle Jul 14, 2023, 9:52 PM

#

Can go whenever code works I think

#

Should currently be set up to save every chunk

#

@pallid current currently set up to just test different parameters (i.e. do a big grid search over L1, dict size, L2 reg)

#

Could set up to test tied weights Vs not, etc, wasn't sure what would be useful ig

pallid current Jul 14, 2023, 10:03 PM

#

splitting out only top scores also makes the value of sparse coding in MLP more clear:

#

bitter turtle Jul 14, 2023, 10:05 PM

#

Sorry don't understand the graph: top scores by what metric? Which label corresponds to those scores?

pallid current Jul 14, 2023, 10:11 PM

#

right sorry i never explained this at all. so what the autointerp does is to take 5 out of the 20 fragments of 64 tokens which have the highest average feature activation, from a pool of 50000 fragments. these are the 'top' fragments. it uses those to generate a hypothesis for what the feature 'is'. then, it takes another 5 random fragments. it uses the explanation to generate a guess for what the activations will be across both the top and random fragments. it then scores the explanation based on the correlation between predicted and actual activations.

#

so the score that i've been reporting previously is the correlation across those 10 fragments, called 'top and random' scoring.

#

but we found that this was a bit misleading for some of the residual stream neurons, because the explanation was able to distinguish clearly between the top fragments and the random fragments at the fragment level - ie high in the top fragments, low in the randoms, but it couldn't predict any of the variation within the fragments

#

so instead i'm now showing scores for correlation within the top fragments and within the random fragments separately

#

and on both of these measures, the sparse-coded features come out very clearly ahead, for both residual stream and mlp

pallid current Jul 14, 2023, 10:23 PM

#

bitter turtle Can go whenever code works I think

ok, but is the code working? do you want to push to main and i can help debug if it's needed

bitter turtle Jul 14, 2023, 10:24 PM

#

yeah code works afaict; trained a few dicts for a very small amount of time on a small amount of data and loss did expected things, and it was faster I think

#

Can push to main if you want

#

On mobile ATM tho

pallid current Jul 14, 2023, 10:26 PM

#

bitter turtle On mobile ATM tho

cool no worries will have a look from your branch

pallid current Jul 14, 2023, 10:27 PM

#

bitter turtle Could set up to test tied weights Vs not, etc, wasn't sure what would be useful ...

and yeah i'm particularly keen to test different versions of the architecture, like i think they're going to be the most interesting kind of results, given that we're definitely seeing some level of signal

bitter turtle Jul 14, 2023, 10:27 PM

#

pallid current but we found that this was a bit misleading for some of the residual stream neur...

Wdym by this ('distinguish' particularly)?

bitter turtle Jul 14, 2023, 10:28 PM

#

pallid current and yeah i'm particularly keen to test different versions of the architecture, l...

Cool, I can start a grid search for tied weights and not and bias reg and not tomorrow?

#

sorry not search

#

big grid training thingy

pallid current Jul 14, 2023, 10:30 PM

#

oh right gotcha

bitter turtle Jul 14, 2023, 10:30 PM

#

yeah mb badly worded

#

'I can try a bunch of different hyperparams for tied and not tomorrow'

pallid current Jul 14, 2023, 10:31 PM

#

yeah that sounds great

#

what was the conclusion wrt reconstruction ICA in the end?

bitter turtle Jul 14, 2023, 10:34 PM

#

no idea haven't tried it never got round to it

#

can also try that

#

Not particularly expecting it to be much different from tied weights

pallid current Jul 14, 2023, 10:35 PM

#

yeah is there actually any difference (except maybe the smooth_l1 loss)?

#

no bias

#

no bias would be kinda interesting from an autointerp point of view (though could just remove the biases manually) just because it means that the found directions are on a totally even footing with the baselines

bitter turtle Jul 14, 2023, 10:39 PM

#

yeah don't think it's particularly interesting. More interested generally looking at explicitly nonlinear things, or better approaches at sparse coding (like FISTA or OMP or something equivalent that is nicely parallelised) and still using linear dictionaries

pallid current Jul 14, 2023, 10:39 PM

#

bitter turtle yeah don't think it's particularly interesting. More interested generally lookin...

yeah agreed

#

have you looked at them at all?

bitter turtle Jul 14, 2023, 10:40 PM

#

slightly but just to the point of 'aaargh this is a nightmare to get to run fast'

pallid current Jul 14, 2023, 10:40 PM

#

i think they're good things to do at some point but unless they're super easy i'm leaning towards just running a really good sweep + auto-interp + additional analysis of the resulting data and aiming to publish based on that

bitter turtle Jul 14, 2023, 10:41 PM

#

Yeah that sounds good I agree with that plan

pallid current Jul 14, 2023, 10:41 PM

#

is there anything i can help with on the infra side today?

bitter turtle Jul 14, 2023, 11:19 PM

#

Could implement tied weight stuff, better logging, saving to long term storage, that kind of thing? Also forgot but I am hosting a friend's birthday tomorrow I probably won't be able to work on it until sun

#

It's a bit cursed, sorry about that, but it was the least hacky way I could think to implement proper ensembling

pallid current Jul 14, 2023, 11:23 PM

#

sure its better than whatever i'd have hacked up. tho i havent looked yet 😅

bitter turtle Jul 14, 2023, 11:25 PM

#

yeah basically short explanation is

torch optimisers are not vectorisable
defaulted to not using autograd at all and instead used torchopt for stateless + vectorisable optimisers

#

i.e. basically Jax in pytorch

pallid current Jul 17, 2023, 1:42 AM

#

posted my results using autointerpretation on LW here: https://www.lesswrong.com/posts/ursraZGcpfMjCXtnn/autointerp-finds-sparse-coding-beats-alternatives

AutoInterpretation Finds Sparse Coding Beats Alternatives — LessWro...

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort …

#

time to scale up ! 🦾

coarse flint Jul 17, 2023, 10:48 AM

#

Hey I am fairly new to interp stuff so sorry for asking but could this be dumbed down a bit: 'We give a feature an interpretability score by first generating a natural language explanation for the feature, which is expected to explain how strongly a feature will be active in a certain context, for example ‘the feature activates on legal terminology’. Then, we give this explanation to an LLM and ask it to predict the feature for hundreds of different contexts, so if the tokens are [‘the’ ‘lawyer’ ‘went’ ‘to’ ‘the’ ‘court’] the predicted activations might be [0, 10, 0, 0, 8]. The score is defined as the correlation between the true and predicted activations.'

#

I guess the main confusion I have is how you end up measuring the true activations of a feature and the predicted activations by the LLM based. Like what do these look like and how can you compare them? Maybe you've explained this already in the LW post and I've missed it.

keen pivot Jul 17, 2023, 3:13 PM

#

pallid current time to scale up ! 🦾

Ya, I'm trying to transition entirely to baukit for both training and interventions, which would be worth it for scaling to 66B parameters.

I can also handle the perplexity check & maybe work w/ someone to do the activation adding/subtracting part? There's team shard folks & nina (who also does SERI, maybe also turntrout's mentee?), who can work on activation engineering using our found directions. I'm reaching out to them now.

pallid current Jul 17, 2023, 3:14 PM

#

oh, yeah i know nina, i liked the activation addition stuff she was doing

#

not, like, well, but she's probably just down the hall lol

keen pivot Jul 17, 2023, 3:16 PM

#

coarse flint I guess the main confusion I have is how you end up measuring the true activatio...

I would look at the previous posts for dictionary learning & openAI's blog post on it to more fully understand it. It took me a few weeks to actually grok this project when I first got into it.

pallid current Jul 17, 2023, 3:39 PM

#

keen pivot I would look at the previous posts for dictionary learning & openAI's blog post ...

yeah agree with logan that the best thing to do is to read the resources i linked in the sparse coding summary, i think people have found robert huben's explainer (https://www.lesswrong.com/posts/a4oPE4xJqkYSz6jMS/explaining-taking-features-out-of-superposition-with-sparse) the most helpful. the basic picture is that you learn a simple autoencoder where the activations in latent space of the autoencoder are the feature activation levels

Explaining "Taking features out of superposition with sparse autoen...

[Thanks to Logan Riggs and Hoagy for their help writing this post.] …

bronze wraith Jul 17, 2023, 3:44 PM

#

bitter turtle <@360082080975290369> generally we should note that the L1 penalty basically sol...

Is it true that the EV of the L1 norms are minimized when the learned are aligned with the real features? I know it seems to work empirically, but in the one example I worked it doesn't pan out theoretically. The example: the real features are the standard basis vectors in R^2, so your data is sample uniformly from the unit square. Take two choices of learned dictionaries: the canonical 2-element dictionary {(0,1), (1,0)}, and a 3-element dictionary which is at 45 degree angles to the canonical one {(1,1), (-1,1), (1,-1)}. Both of these can learn a perfect reconstruction of the data, and when you learn that the 3-element dictionary actually has a lower L1 penalty term (by ~8%). I'm legitimately confused about why a rotated basis is optimal here, but the experiments seem to find the canonical basis. It might be some combination of 1) learning canonical features requires fewer features, 2) even if rotated beats canonical when constrained by perfect reconstruction, if you trade off with reconstruction error canonical is better, 3) its sensitive to sampling space, and something in the many correlated dimensions/activations matters, 4) canonical features are easier to learn for some reason, and the training gets stuck in this configuration which is "suboptimal" from a loss function perspective but optimal for our actual goals

bronze wraith Jul 17, 2023, 3:57 PM

#

keen pivot <@748975058415910923> , would you be able to think through the math of a tied em...

Had some thoughts about tied embeddings: if there is no bias term, they are piecewise-positive transformations. Meaning that when you partition the domain by the hyperplanes where the ReLU terms switch on/off, on each subset of the domain they are given by x-> (M^T)Mx, and the M matrix in each section will be the tied embedding matrix with rows zeroed out depending on which ReLU terms are active. Positive transformations are nice because they have an orthonormal basis with respect to which they are a diagonal scaling. However, I think the orthonormal bases of different pieces don't have to agree, nor do the bases have to align with the planes which separate the pieces. Finally, you can roll the bias term into the learned tied embedding in the usual way (by replacing the vector x=(x_1, ..., x_n) with (x_1, ..., x_n, 1)), and when you tie that into the matrix, you need to not score the L1 activation or the reconstruction loss of the last component, but otherwise you can store everything in a big tied matrix (sorry if thats unclear).

bitter turtle Jul 17, 2023, 4:19 PM

#

bronze wraith Is it true that the EV of the L1 norms are minimized when the learned are aligne...

How does this look under spike and slab distributed basis vectors? Also, I guess what we are trying to find with sparse coding is the most sparse factorisation of the activation distribution, under the assumption that that is more interpretable. Interesting that the algo learns (0, 1) (1, 0)

pallid current Jul 17, 2023, 4:19 PM

#

bronze wraith Is it true that the EV of the L1 norms are minimized when the learned are aligne...

how does this work? this seems to me to be like the claim that you can do a shorter trip by taking 2 sides of a triangle than one

#

i think this might be because you need to rescale (1,1), (1,-1) and (-1, 1) to have unit norm

bronze wraith Jul 17, 2023, 5:06 PM

#

pallid current how does this work? this seems to me to be like the claim that you can do a shor...

Every representation has an advantage in encoding things closer to its basis vectors. For instance, with the 3-element dictionary, you have a lower L1 cost to represent (1,1), since you can take the (1,1) vector directly there (after normalizing that vector, you end up with a L1 cost of sqrt(2), in contrast to an L1 cost of 2 if you go along the canonical basis vectors). The canonical dictionary elements are more cost-effective near the axes, whereas the 3-element dictionary is more cost-effective near the line y=x. And given this sampling space (uniform across the unit square), the 3-element dictionary just barely squeaks it out.

bronze wraith Jul 17, 2023, 5:14 PM

#

pallid current i think this might be because you need to rescale (1,1), (1,-1) and (-1, 1) to h...

I was already including that in the calculation (I gave the unnormalized vectors for ease of writing, but I should have said that). To represent (x,y) with the standard vectors, your L1 cost is x+y, but to represent it with the 3-element dictionary the L1 cost is max(x,y)*sqrt(2). Here's a spreadsheet showing the 3-element dictionary having lower average L1 cost (sampling points from the unit square): https://docs.google.com/spreadsheets/d/1rsEbKy_16qwOGguw0Vbxf60Nogkmqmyw6961w4ytbfE/edit#gid=0

Google Docs

L1 cost Sampling

Uniform Frequency

Datapoint #,X,Y,L1 cost to reconstruct (canonical features),L1 cost to reconstruct (rotated features),Relative Frequency,Average L1 cost (canonical),Average L1 cost (rotated)
0,0,0,0,0,1,1,0.9642365198
1,0.1,0,0.1,0.1414213562,1
2,0.2,0,0.2,0.2828427125,1
3,0.3,0,0.3,0.42426406...

bronze wraith Jul 17, 2023, 5:17 PM

#

bitter turtle How does this look under spike and slab distributed basis vectors? Also, I guess...

What do you mean by spike and slab distributed basis vectors?

bitter turtle Jul 17, 2023, 5:25 PM

#

like p probability of being 0 otherwise uniform

#

I've set off a couple runs testing all combinations of the following parameters:

tied vs not tied
l1 coef \in [0.0031622776601683794, 0.01, 0.03162277660168379, 0.1]
bias l2 decay \in [0.0, 0.05, 0.1]
dict ratio \in [2, 4, 8]

#

should be good in a couple hours or so (guessing?)?

#

https://wandb.ai/sparse_coding/sparse coding/runs/ybxhr7hf <- wandb for the run, a bit jank because im multiprocessing it improperly and not syncing, basically the indexes are ~meaningless

W&B

sparse_coding

Weights & Biases, developer tools for machine learning

pallid current Jul 17, 2023, 5:48 PM

#

bronze wraith I was already including that in the calculation (I gave the unnormalized vectors...

ah gotcha sorry. with the way you're generating the data, there's nothing particular special or sparse about the basis dimensions (0, 1) or (1, 0) though so it makes sense to me that you wouldnt expect them to minimize l1 loss. if that were true we'd just learn the identity function at every point

#

if there was a high likelihood of only having X or Y active (which would happen if they were sparse) then you'd recover (0,1) (1,0) i think

bronze wraith Jul 17, 2023, 5:53 PM

#

pallid current ah gotcha sorry. with the way you're generating the data, there's nothing partic...

Ah, sure. I added a third tab to that sheet where the frequency is increased along the axes, and the canonical basis performs better there. I think that's what aidan meant as well.

keen pivot Jul 17, 2023, 6:04 PM

#

bitter turtle I've set off a couple runs testing all combinations of the following parameters:...

Residual stream?

pallid current Jul 17, 2023, 6:11 PM

#

bitter turtle I've set off a couple runs testing all combinations of the following parameters:...

nice this sounds super good, let me know when and where and i'll look at auto interp on some

#

if it's only a couple of hours, can you run the same sweep across all layers and mlp vs residual?

#

also i worry those l1 coefs might be a bit high, at least without any reinitialization

keen pivot Jul 17, 2023, 6:28 PM

#

I would nail down the l1 first, which is typically the same across layers (though a unique l1 for mlp and another for residual)

#

Oh, I think the L1 is different for tied & untied. I agree that these l1 values are ~~too ~~a bit high.

#

For
MLP tied: 6e-4 (1e-4 is identity)
Residual tied: 3e-3 (which Aidan is indeed checking, though I'm unsure how the bias will interact)

bitter turtle Jul 17, 2023, 7:31 PM

#

keen pivot Residual stream?

MLP

#

ah, ok!

#

I'll let this run conclude then do one looking for L1 values

#

also my time guess was off by a factor of 2, probably can rewrite for more speed somewhere

#

I think I'm bottlenecking on one GPU looking into it

keen pivot Jul 17, 2023, 7:44 PM

#

It looks pretty rad so far. Thanks for working on it!

pallid current Jul 17, 2023, 7:45 PM

#

bitter turtle I think I'm bottlenecking on one GPU looking into it

as in it's not parallelizing across the 8 at all atm? or just it's over using 1

bitter turtle Jul 17, 2023, 7:52 PM

#

it's overusing 1 and it has to wait periodically to sync kinda

pallid current Jul 17, 2023, 11:54 PM

#

doing some experiments to see how ability to use superposition varies with residual dimension

pallid current Jul 18, 2023, 2:21 AM

#

lol i was gonna follow that message up with a graph but then i realised it had a big flaw 😅, hopefully will have something tomorrow

coarse flint Jul 18, 2023, 12:41 PM

#

pallid current yeah agree with logan that the best thing to do is to read the resources i linke...

Lovely! Thanks

keen pivot Jul 18, 2023, 3:12 PM

#

pallid current doing some experiments to see how ability to use superposition varies with resid...

I guess I'll wait until you have results, but interested in elaborations!

keen pivot Jul 18, 2023, 3:13 PM

#

bitter turtle https://wandb.ai/sparse_coding/sparse%20coding/runs/ybxhr7hf <- wandb for the ru...

How hard would it be to integrate the MCS plots? Or at least, I'm not seeing them in wandb

pallid current Jul 18, 2023, 4:25 PM

#

bitter turtle https://wandb.ai/sparse_coding/sparse%20coding/runs/ybxhr7hf <- wandb for the ru...

l1 and recostruction loss seems to be monotonically decreasing which is better than i was often seeing before

#

bit odd that we're seeing periodic spikes, i guess at the beginning of a chunk..

pallid current Jul 18, 2023, 8:21 PM

#

@bitter turtle is there any way of taking the saved dictionaries from your run yesterday back into the original class?

keen pivot Jul 18, 2023, 9:06 PM

#

Getting some pretty bad perplexities for replacing the model w/ the dictionary reconstruction for pythia-70m-tied layer 3:

Dict Size | Perplexity | Reconstruction Loss
512: 180.98 0.0964
1024: 152.60 0.0870
2048: 127.85 0.0804
4096: 111.12 0.0763
8192: 104.80 0.0753
full model: 25.11 0.000

#

And the perplexity code is pretty simple, so I think I coded it right (& pythia 410m got ~11 perplexity, which makes directional sense)

pallid current Jul 18, 2023, 9:07 PM

#

ok that's interesting. not super surprising though it would have been great if we didnt see this

#

which dicts are you using?

keen pivot Jul 18, 2023, 9:07 PM

#

Oh, sorry I sent it then (jk jk)

keen pivot Jul 18, 2023, 9:08 PM

#

pallid current which dicts are you using?

It should be specified above(?)

#

Oh, it's actually layer 3, I can link the aws location. Any other identifying info?

pallid current Jul 18, 2023, 9:09 PM

#

just seeing if they were the ones you trained or aidan's recent ones

keen pivot Jul 18, 2023, 9:15 PM

#

pallid current just seeing if they were the ones you trained or aidan's recent ones

I updated my messaged to include the equivalent reconstruction loss, so we can compare w/ Aidan's runs (though his is MLP, and this is residual stream)

pallid current Jul 18, 2023, 9:17 PM

#

ok thats v interesting because just eyeballing some of aidans runs, we're seeing recon loss about an OOM lower

keen pivot Jul 18, 2023, 9:18 PM

#

It is MLP, but I'm unsure what else would be different besides the bias l2 decay

pallid current Jul 18, 2023, 9:19 PM

#

i don't know what would be different either (tho i think its possible bias l2 is actually v important for preventing dead feats) but the loss curves look waay more stable, i remember seeing that they seemed to plateau pretty hard and even start rising

keen pivot Jul 18, 2023, 9:20 PM

#

Probably the untied parts? Nope, just checked & it's also low

#

It might just be the MLP. I expect reconstruction to hurt perplexity less in an MLP layer too, but I would definitely like this same run for residual stream!

#

Yo @bitter turtle , could you set off a run for residual stream? I can also look through your code and try if you're not able to.

pallid current Jul 18, 2023, 9:30 PM

#

keen pivot Yo <@332271551481118732> , could you set off a run for residual stream? I can al...

we should really learn how to run the train loop haha

keen pivot Jul 18, 2023, 9:31 PM

#

pallid current we should really learn how to run the train loop haha

I think I've got it. I am copying over the output from the last run, cause I think it'll overwrite it

#

Aidan ran "big_sweep.py", and I think I just need to set "use_residual" to True

bitter turtle Jul 18, 2023, 9:33 PM

#

keen pivot Yo <@332271551481118732> , could you set off a run for residual stream? I can al...

Can do tomorrow I need to sleep, haven't worked on anything today sorry

keen pivot Jul 18, 2023, 9:34 PM

#

bitter turtle Can do tomorrow I need to sleep, haven't worked on anything today sorry

Nooo problem. I'll give it a try for an hour and give up if not.

#

@pallid current , I ran something at https://wandb.ai/sparse_coding/sparse coding/runs/h4zcf2jq

pallid current Jul 18, 2023, 9:41 PM

#

ok so reconstruction loss already seems to be like half of he ones you quoted above??

keen pivot Jul 18, 2023, 9:44 PM

#

pallid current ok so reconstruction loss already seems to be like half of he ones you quoted ab...

Yep, and there are different l1 values (from 0.003 to 0.1), which means that the higher l1 value should suck in reconstruction, but it doesn't?

#

I haven't figured out the naming scheme for l1 values yet

#

Slight hitch: I didn't delete the activations_data folder, so it re-used that and used the mlp data & maybe even the mlp-sized model(?) Re-running

keen pivot Jul 18, 2023, 10:28 PM

#

It ran for 7 times (out of 30 from the MLP) Then I got:
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

#

The reconstruction losses still look good, though I'm still confused on what's the l1 values correspond to.

shell mural Jul 18, 2023, 10:30 PM

#

hey, this work looks unbelievably high value. I look forward to grokking it and seeing if there's any way i can contribute, even if it's just by getting more people looking at it

shell mural Jul 18, 2023, 10:31 PM

#

keen pivot Ya, I'm trying to transition entirely to baukit for both training and interventi...

i had a call with alex turner. He's gonna look more into dictionary learning too

bitter turtle Jul 18, 2023, 11:47 PM

#

keen pivot It ran for 7 times (out of 30 from the MLP) Then I got: [W CudaIPCTypes.cpp:15] ...

this isn't the actual error message, this is just saying that it crashed unexpectedly. If you scroll up a bunch you might see a better error?

keen pivot Jul 19, 2023, 12:38 AM

#

bitter turtle this isn't the actual error message, this is just saying that it crashed unexpec...

lol, well it's gone now. We'll figure it out eventually!

I'm off for the night as well. I am having difficulty wrangling the learned dictionaries (since it's separated by encoder, decoder, encoder-bias).

I could code up something to handle it tomorrow, but if it's not too hard to save the whole autoencoder, that'd be preferred for me.

keen pivot Jul 19, 2023, 12:39 AM

#

shell mural hey, this work looks unbelievably high value. I look forward to grokking it and ...

Glad to see you here! Feel free to ask any questions here:)

pallid current Jul 19, 2023, 4:01 AM

#

instead of going through the hassle of matching encoder to cfg, this eve i ran an experiment suggested by @worldly hinge where i clustered the directions and then ran autointerp on the features in that cluster. ran on 10 clusters, 3 clusters just never activated, 1 seemed to consistently give v similar explanations ('hyphens') and the others were at least somewhat varied

#

most fun one was a cluster of number features ! :

421: 'instances of the number one.' 
559: "the digit '3' in the text."
1112: 'instances of the number 5.'
1459: 'numbers, particularly two-digit numbers where the second digit is a high number.' 
1503: 'numbers, particularly single digit numbers.' 
1744: ' numerical values, especially those used in a counting or sequence context.'
1954: 'the number 4 in the text.'

#

full results:

📎 top_clusters.txt

#

@keen pivot for the similar looking cluster, the hyphen one and also the one that centres around 'was'/'are'/'is' where there are some near identical autoexplanations i'd be interested to see if you find that they have distinct meanings

bitter turtle Jul 19, 2023, 8:18 AM

#

Oh that's actually a sick idea

#

@pallid current did you manage to run it?

#

@keen pivot I'll rewrite the saving code today, so that it saves in a more intuitive format!

bitter turtle Jul 19, 2023, 8:37 AM

#

pallid current instead of going through the hassle of matching encoder to cfg, this eve i ran a...

What's like the size of these clusters (in max cosine dim and number of directions or some other metric you think will be better)

bitter turtle Jul 19, 2023, 8:53 AM

#

Or maybe 'what part of the unit sphere does each cluster enclose' or something

#

Just trying to get a feel for how spaced out they are

keen pivot Jul 19, 2023, 1:57 PM

#

pallid current <@360082080975290369> for the similar looking cluster, the hyphen one and also ...

Which dictionary is this? If you can link the path on the instance/node (what’s the term?), I can load from that.

bitter turtle Jul 19, 2023, 2:21 PM

#

right got better saving (it's seeminly very slow now for some reason?) should I run a residual stream run @keen pivot? What parameter settings?

keen pivot Jul 19, 2023, 2:23 PM

#

bitter turtle right got better saving (it's seeminly very slow now for some reason?) should I ...

I do want a residual stream one. Thanks! Same parameter settings, but l1 shifted lower by 1 (which should translate to -5 instead of -4, if that makes sense?)

bitter turtle Jul 19, 2023, 2:24 PM

#

oh ok!

keen pivot Jul 19, 2023, 2:24 PM

#

I think Hoagy wants an MLP one

bitter turtle Jul 19, 2023, 2:24 PM

#

what layer?

#

residual will be shorter I'll do that first

keen pivot Jul 19, 2023, 2:25 PM

#

I’m fine with layer 2

#

Unsure about Hoagy

keen pivot Jul 19, 2023, 2:25 PM

#

bitter turtle residual will be shorter I'll do that first

It’s okay if you have favorites

bitter turtle Jul 19, 2023, 2:25 PM

#

pahaha

#

Hmm I only get 8 chunks, I think we just might be reaching max data for pile10k?

#

@keen pivot can you remember how many chunks you normally get for residual stream?

keen pivot Jul 19, 2023, 2:47 PM

#

We could do the pile’s first shard

keen pivot Jul 19, 2023, 2:49 PM

#

bitter turtle <@360082080975290369> can you remember how many chunks you normally get for resi...

I can’t!

bitter turtle Jul 19, 2023, 2:53 PM

#

keen pivot We could do the pile’s first shard

also true

#

current format is each model is saved as a dictionary {"params": {"encoder": ..., "decoder": ..., ...}, "buffers": {...}}, and you can check hyperparameters in a JSON file hyperparams.json saved with the models @keen pivot

#

accidentally goofed, 10m

keen pivot Jul 19, 2023, 3:17 PM

#

a problem (maybe): I’ve noticed a discrepancy between transformer lens and transformers library precision of Pythia model residual stream, which is unclear how much it’d effect our results: https://github.com/neelnanda-io/TransformerLens/issues/346#issuecomment-1641576171

GitHub

[Bug Report] hook_resid_pre doesn't match hidden_states · Issue #34...

Describe the bug cache[f"blocks.{x}.hook_resid_pre"] doesn't match hidden states (or only up to a set decimal place). Hidden states is from transformer's model(tokens, output_hidd...

kind scroll Jul 19, 2023, 3:32 PM

#

Lee pointed me here and I have a bit more time free to contribute. I plan to read up on your posts so far, but as I understand it, the high-level idea is to take a trained LLM which has features which may or may not be in superposition (have you thought of how you'd measure this?), and then training sparse autoencoders to recover features. Is that the gist of it?

#

If you're still thinking of trying a sparse VAE I'm happy to contribute there, too!

#

if there's currently any problems/challenges you have I'd love to hear about them

bitter turtle Jul 19, 2023, 3:35 PM

#

@keen pivot should be in /mnt/ssd-cluster/resid_layer_2_19_07

keen pivot Jul 19, 2023, 3:41 PM

#

kind scroll Lee pointed me here and I have a bit more time free to contribute. I plan to rea...

Yep! I'm unsure how to measure the amount of superposition, but there's definitely feature packing (which may be superposition).

One argument: there's 50k vocab items which are then embedded into 500 dimensional space; probably feature packing there.

Another: optimality - more optimal to pack features as long you as you have sparse features.

Best argument: wes found features in superposition in his paper (neurons in a haystack)

bitter turtle Jul 19, 2023, 3:42 PM

#

kind scroll Lee pointed me here and I have a bit more time free to contribute. I plan to rea...

Yeah couple things about this (measuring superpositionness directly):

we don't really know what variances to expect for activation tails, and the variances for different features vary WILDLY
someone mentioned finding directions with high kurtosis (as a proxy for how spike-and-slabby the distributions of activations are)

#

honestly don't see much point measuring it 'directly' since that would probably amount to 'see how well something shown to fit on superposed data fits on activation distributions' which is just training SAEs

#

but then again rough thoughts would be interested to hear proposals

kind scroll Jul 19, 2023, 3:46 PM

#

I think the idea I had behind measuring it is more for scientific purposes - being able to qualitatively measure the degree of superposition would let you say a particular technique for sure takes something out of superposition, or reduces superposition by X quanta in this LLM. Then you'd also be able to look at, for example, how autointerp methods relate to superposition, and go on to using it in your model selection criteria

#

i.e. we've trained these two models but this one has higher superposition and we can't understand it as well, let's not deploy

#

(also spitballing here)

bitter turtle Jul 19, 2023, 3:48 PM

#

I mean, afaict 'superposition' isn't a sufficently rigorous definition to be used like that; what we are really testing here isn't 'does a model do TMOS-style superposition' but rather 'is the abstraction detailed in TMOS a useful one'.

kind scroll Jul 19, 2023, 3:49 PM

#

agreed

bitter turtle Jul 19, 2023, 3:50 PM

#

Like you could measure 'max spike-and-slabbiness of the distribution of activations over rotations' but that feels unfounded and too-many-holes-having.

#

holes being degenerate cases

#

dunno why I said holes

#

anyway my take is that the metric for deployness or whatever will be something like 'how accurately can we abstractly describe functionality' or something which isn't neccesarily directly superposition-related. Like, I feel like if we can measure superposition there's a good chance that that mesurement method also has an unsuperpositonification mechanism as an immediate corollory

#

spelling

kind scroll Jul 19, 2023, 3:54 PM

#

I think this is the most compelling real-world example of TMOS-style superposition I've seen so far which I'm sure you've seen too (if you have more please send!) https://distill.pub/2020/circuits/zoom-in/#claim-2-superposition. I think it's important because it relates inductive biases in model representations to downstream tasks, which is the real measure

#

(and I agree, I think superposition is a useful concept but doesn't directly relate to an atomic, measureable phenomenon)

bitter turtle Jul 19, 2023, 3:56 PM

#

I'm not sure how strongly to take that evidence. I vaguely remember hearing somewhere that there is some nuance in how they generate those images. Also, that seems to mostly be entanglement (as in, viewing the activations in the 'wrong' basis) which is something you also see in e.g. VAEs

kind scroll Jul 19, 2023, 3:57 PM

#

Point taken! Though wdym with entanglement?

bitter turtle Jul 19, 2023, 3:57 PM

#

like, rotation as opposed to 'compression'

#

In general when I'm thinking about superposition I'm thinking about it more as a useful lens to view activations rather than something stronger and natural-abstraction-hypothesis-assuming

#

so like, stronger than Nora's views and less strong than Olah's I guess.

bitter turtle Jul 19, 2023, 4:01 PM

#

kind scroll I think this is the most compelling real-world example of TMOS-style superpositi...

Inductive bias analysis thing is an interesting point though, let me think about that for a sec

#

Ok, I'm not sure how you'd measure that without also measuring superposition.

#

Could you expand more on what you envision?

keen pivot Jul 19, 2023, 4:06 PM

#

One grounding of amount of superposition is how many features our dictionary learns w/ eps-diff in perplexity.

bitter turtle Jul 19, 2023, 4:07 PM

#

perplexity? Like perplexity under intervention with reconstructed features?

#

Not sure how that would work; surely if there is some subspace our SAEs fail to describe then there is a minimum perplexity gain

keen pivot Jul 19, 2023, 4:10 PM

#

bitter turtle current format is each model is saved as a dictionary `{"params": {"encoder": .....

The hyperparams look great, thanks!

bitter turtle Jul 19, 2023, 4:10 PM

#

Phew

keen pivot Jul 19, 2023, 4:10 PM

#

Is there a way to easily download the model given the .pt file?

bitter turtle Jul 19, 2023, 4:10 PM

#

it's just saved as a dict

kind scroll Jul 19, 2023, 4:11 PM

#

bitter turtle Could you expand more on what you envision?

(a lot of papers in representation learning talk about this idea, https://arxiv.org/abs/1811.12359 is a good one, though the slant in representation learning is a bit different, imo it seems a bit nicer if all we want to do is understand the representations, rather than learn useful representations like world models)

keen pivot Jul 19, 2023, 4:13 PM

#

bitter turtle it's just saved as a dict

I have torch.load()-ed it, but was hoping for a one-liner for
autoencoder = ...

atm, I can define an autoencoder and assign each relevant part to the part in the dictionary, which is doable, but I may be missing the intended way to load it.

bitter turtle Jul 19, 2023, 4:13 PM

#

keen pivot I have torch.load()-ed it, but was hoping for a one-liner for autoencoder = ... ...

Ah, no, not yet sorry

kind scroll Jul 19, 2023, 4:13 PM

#

I was referring there to the authors hypothesising that models find it useful to store some less-important features in superposition rather than dedicating e.g. an axis-aligned dimension to it

bitter turtle Jul 19, 2023, 4:15 PM

#

I think people generally call non-axis-alignment 'entanglement' and the compression thing 'superposition', and also where are you referring to here? The distil.pub paper?

kind scroll Jul 19, 2023, 4:15 PM

#

ye

keen pivot Jul 19, 2023, 4:16 PM

#

bitter turtle Ah, no, not yet sorry

No problem! Just making sure I wasn't missing anything

kind scroll Jul 19, 2023, 4:16 PM

#

I'm used to representation learning-lingo and there's lots of terms being used in the alignment space w.r.t disentanglement/representations that I'm not fully used to haha

bitter turtle Jul 19, 2023, 4:16 PM

#

kind scroll I'm used to representation learning-lingo and there's lots of terms being used i...

Yep no idea same tbh

#Sparse Coding