Sparse Coding | EleutherAI | Page 6

bitter turtle Sep 7, 2023, 3:25 PM

#

I'm implementing one now which ablates features based on their class-predicition ability, or conversely their damage to class-prediction when ablated

keen pivot Sep 7, 2023, 3:31 PM

#

bitter turtle I'm implementing one now which ablates features based on their class-predicition...

I'd like to have the top-5 of those features found, then provide data on which inputs they activate on & which outputs they effect (including the direction found by LEACE if that's meaningful?).

Though the output of that's going to be kind of lame. Like to get the effect on the output, you run on a lot of data, which will include part of the distribution you'd be testing for.

bitter turtle Sep 7, 2023, 3:32 PM

#

yes, the activation frequency thing is basically just pointless here imo. output might be interesting.

keen pivot Sep 7, 2023, 3:32 PM

#

Though I think this is just the way it is & it's fine? Like dictionaries allow you to select from multiple different features, and alternative methods don't.

bitter turtle Sep 7, 2023, 3:33 PM

#

I did look at activations for ablated features in the previous run btw

keen pivot Sep 7, 2023, 3:33 PM

#

bitter turtle yes, the activation frequency thing is basically just pointless here imo. output...

How much it activates? Ya not useful!

bitter turtle Sep 7, 2023, 3:33 PM

#

keen pivot How much it activates? Ya not useful!

or rather, the token set it activates on

keen pivot Sep 7, 2023, 3:33 PM

#

But what words they activate on

#

Yep

bitter turtle Sep 7, 2023, 3:33 PM

#

still don't think that this is this useful, for the reasons you say "lame"

#

output cool however

keen pivot Sep 7, 2023, 3:34 PM

#

I think the output is lame in the same reason: you're directly checking the test distribution (probably) when finding the effect on the output on e.g. Pile-10k.

bitter turtle Sep 7, 2023, 4:55 PM

#

also turns out i've been incorrectly plotting activation edit amount this entire time, it should be for the other experiment

#

which is weird because it's more surgical than LEACE under that measure

#

this is for feature selection based off linear erasure amount, i.e. how badly a logistic regression model performs when trying to linearly discriminate based off the activations under the projection sending a dictionary component to nullspace

bitter turtle Sep 7, 2023, 5:08 PM

#

bitter turtle this is for feature selection based off linear erasure amount, i.e. how badly a ...

this being said, the optimal feature to ablate here is often the second or third best one under this measure of goodness

keen pivot Sep 7, 2023, 5:35 PM

#

bitter turtle this being said, the optimal feature to ablate here is often the second or third...

The optimal feature to ablate (optimal as in leads to reduced prediction ability when ablated) is 2nd or 3rd under measure for logistic regression?

bitter turtle Sep 7, 2023, 8:14 PM

#

tried out transfer for a Very Shit dataset which is basically a rehash of the other one for pronoun instead of gender prediction

#

original dataset

#

transferred results

#

nowhere near robust but maybe promising

bitter turtle Sep 7, 2023, 8:17 PM

#

bitter turtle this is for feature selection based off linear erasure amount, i.e. how badly a ...

this is 'best of top-4 best dictionaries for other dataset' as judged by this measure

#

@keen pivot I can dm you the dictionary indexes now

#

or in a few mins, sorry

#

Summarisation of current alg:

get 4 candidate directions via testing for erasure with linear binary classifiers
sort by perf on test dataset
measure perf on main and other dataset

bitter turtle Sep 7, 2023, 8:49 PM

#

This is currently a Very Shit and Uncool algorithm, but I have hope that something similar would work with a more sanely designed autoencoder

bitter turtle Sep 8, 2023, 1:31 PM

#

probably not that relevant anymore, but maybe useful for other people's stuff, I figured out that our previous attempts at larger/deeper autoencoders basically failed bc they had a fuckton of dead neurons at the final thresholding layer, but this can be remidied by using something like e.g. softplus instead of the ReLU

keen pivot Sep 8, 2023, 4:22 PM

#

@bitter turtle , do you know how we're downloading data now that the pile's down? I basically want like 2 billion tokens, but openwebtext is like 8M datapoints (want more like 100k), and I don't know how to download it or to get streaming=True to do what I intuitively want.

bitter turtle Sep 8, 2023, 4:23 PM

#

I am honestly Suffering, and just firming downloading openwebtext

#

there is definately a smarter way but I haven't Suffered enough to bother to look

keen pivot Sep 8, 2023, 4:42 PM

#

bitter turtle there is definately a smarter way but I haven't Suffered enough to bother to loo...

How many tokens do we train our models on normally?

#

I could just host a openwebtext-100k, which'd be one solution

bitter turtle Sep 8, 2023, 4:44 PM

#

keen pivot How many tokens do we train our models on normally?

1 chunk is ~2M tokens I think, for 512-d resid

#

halfs for double activation size

keen pivot Sep 8, 2023, 4:48 PM

#

https://api.wandb.ai/links/sparse_coding/pmqc0y4k

Weights & Biases

Gpt2 trained on perplexity & reconstruction

Gpt2 perplexity on this dataset is ~30, so trained on the Pile-10k (~250k using my dataloader), it reaches 34-35 perplexity.

bitter turtle Sep 8, 2023, 5:05 PM

#

Sick

keen pivot Sep 8, 2023, 5:21 PM

#

So I'm just setting off a run for training on openwebtext w/ gpt2 w/ KL & perplexity.

So maybe this will just converge to model's original perplexity and we're done as far as performance (everything else would just be converging more efficiently!)

#

Downloading the dataset took like 20 min. Chunking & tokenizing is like 1hr. I'll just check back on it tomorrow & hope there's no weird bug that happens (the wandb above is from running on pile-10k, so it works on that at least!)

bitter turtle Sep 8, 2023, 6:06 PM

#

So maybe this will just converge to model's original perplexity and we're done as far as performance (everything else would just be converging more efficiently!)
Hmm, I wouldn't think so if we are trying to use these AEs to study the behaviour of a preexisting LM; I expect that it would be possible for the AE to improve perplexity by doing nontrivial computation rather than 'reconstructing activations to have the same semantic meaning' or whatever. This is fine for studying e.g. a model trained with many high-rank sparse disentanglement layers as part of it's arch (like CRATE or something https://arxiv.org/abs/2306.01129), which is for sure something I want to do, but I worry it wouldn't be as useful for studying the non-sparse-expanded model @keen pivot

bitter turtle Sep 8, 2023, 6:44 PM

#

Sorry for hogging GPUs @keen pivot

#

doing acdc-type-stuff with dicts is expensive asf

keen pivot Sep 8, 2023, 6:51 PM

#

bitter turtle > So maybe this will just converge to model's original perplexity and we're done...

But it's trained on KL.

Though I do agree that something like CRATE would be cool cause functionally equivalent is really want we want.

bitter turtle Sep 8, 2023, 6:52 PM

#

keen pivot But it's trained on KL. Though I do agree that something like CRATE would be c...

even so, it could be doing something accursed, less likely with KL tho

keen pivot Sep 8, 2023, 6:52 PM

#

bitter turtle even so, it could be doing something accursed, less likely with KL tho

Wait, how would functionally equivalent be accursed?

bitter turtle Sep 8, 2023, 6:52 PM

#

internals might be different

#

seems significantly less likely/problematic tho

#

as in practically irrelevantly problematic

keen pivot Sep 8, 2023, 6:54 PM

#

Do you know what data the gpt2 models were trained on? Pile or openwebtext?

bitter turtle Sep 8, 2023, 6:54 PM

#

not a clue sorry

keen pivot Sep 8, 2023, 7:06 PM

#

Perplexity for gpt2: 28
Perplexity for dictionary: 32

This is before training on KL on openwebtext. That's pretty good! On Pile-10k it was (30, 40) for (gpt2, dict-base)

bitter turtle Sep 8, 2023, 8:56 PM

#

got these done for layers 4,6..18, took an insane amount of time (hours)

#

will probably use 12 because it looks the most aesthetic/illustrates what I want to illustrate the best, will include others in appendix or smthn maybe

bitter turtle Sep 8, 2023, 9:19 PM

#

@bronze wraith do these look better in B+W?

#

I'm worried that they aren't that distinguishable

#

(less bumpy one is normal dataset, more bumpy one is untrained transfer)

bronze wraith Sep 8, 2023, 9:34 PM

#

bitter turtle <@748975058415910923> do these look better in B+W?

yep, those are fine in black and white!

keen pivot Sep 9, 2023, 2:38 PM

#

keen pivot Perplexity for gpt2: 28 Perplexity for dictionary: 32 This is before training o...

After training directly on KL & reconstruction we went from 32.6->30.6 (it converged!) with 28 being the original perplexity. Very good! lol.

I want to check which datapoints are badly represented to see if there's a trend (though, it might be because it's not a big enough dictionary to capture all the features! currently doing a 6x dictionary)

The larger perplexity difference last time was (probably) due to the Pile-10k which was a different distribution than Openwebtext

keen pivot Sep 9, 2023, 3:16 PM

#

Running one on 16x ratio & checking # of dead features now for both 8x & 16x (currently 16x has 4k-5k, dead which is ~40% of the features. Definitely need to do that soft ReLU that @bitter turtle mentioned)

bitter turtle Sep 9, 2023, 5:17 PM

#

mhmm would be interesting

glass tinsel Sep 9, 2023, 11:22 PM

#

Remark: if you LEACE away the gender concept in the residual stream at layer i, the tuned lens output at layer i will have no better than chance gender prediction accuracy. So if the final layer output is better than chance this has to mean that later layers are recovering the gender info in their residual updates.

#

You could track this process with the tuned lens

bitter turtle Sep 9, 2023, 11:58 PM

#

glass tinsel You could track this process with the tuned lens

This is a really cool suggestion, I'll try to look at this soonish!

#

I expect a linear probe to be able to discern gender under the dictionary ablation, but I'm not sure what tuned lens would do.

bitter turtle Sep 10, 2023, 12:31 PM

#

keen pivot Running one on 16x ratio & checking # of dead features now for both 8x & 16x (cu...

Yeah for sure think we should do a comparison

#

Between purely ReLU and softplus

#

For our single-layer models

bitter turtle Sep 10, 2023, 2:34 PM

#

@pallid current would it be possible to do an ICA run for pythia-410m at layers [0,2,4,6,8,10,12,14,16,18,20,22], or would that take too long to converge?

bitter turtle Sep 10, 2023, 2:37 PM

#

bitter turtle got these done for layers 4,6..18, took an insane amount of time (hours)

Also I realise this is maybe a slightly unfair comparison, do you think I should be comparing to PCA-but-its-doubled-for-only-positive-components as well @pallid current @keen pivot?

#

Trouble is no-one has really done decomp of residual stream measured like this so I am essentially making up baselines

keen pivot Sep 10, 2023, 6:19 PM

#

bitter turtle Also I realise this is maybe a slightly unfair comparison, do you think I should...

I think the PCA is fine as is.

bitter turtle Sep 10, 2023, 7:35 PM

#

Maybe it would be interesting to contrast both

bitter turtle Sep 12, 2023, 10:29 PM

#

putting this here for tomorrow

dusk hatch Sep 14, 2023, 9:19 AM

#

Hi 👋 I was pointed to sparse coding / superposition. I'm wondering if there is a canonical reference on the models people are fitting?
For context, I read some papers and may have some outsider insights. I don't want to wade in without checking how far people are doing the same things under sparse coding / superposition.

bronze wraith Sep 14, 2023, 11:28 AM

#

dusk hatch Hi 👋 I was pointed to sparse coding / superposition. I'm wondering if there is ...

The main approach is basically as its described here: https://www.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition

[Interim research report] Taking features out of superposition with...

We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black. …

dusk hatch Sep 14, 2023, 11:58 AM

#

Thanks for clarifying!

keen pivot Sep 14, 2023, 1:54 PM

#

dusk hatch Thanks for clarifying!

For code versions (which we have many model types we've tried!) look here: https://github.com/HoagyC/sparse_coding/blob/main/autoencoders/learned_dict.py#L124

GitHub

sparse_coding/autoencoders/learned_dict.py at main · HoagyC/sparse_...

Using sparse coding to find distributed representations used by neural networks. - HoagyC/sparse_coding

bitter turtle Sep 15, 2023, 2:23 PM

#

@keen pivot

glass tinsel Sep 18, 2023, 2:00 AM

#

So, I'm still pretty worried / unconvinced by the experiments in Section 4 of the arXiv paper tbh

#

An alternative hypothesis that explains the data is something like "the dictionary ablation is messing up the model's capabilities more overall"

Captura_de_pantalla_2023-09-17_a_las_7.00.29_p.m..png

#

and the mean squared error doesn't necessarily capture this. one of the nice things about LEACE is that it actually minimizes a bunch of different squared error metrics all at once

keen pivot Sep 18, 2023, 2:02 AM

#

glass tinsel An alternative hypothesis that explains the data is something like "the dictiona...

#

It's sadly in the appendix, though I've wanted it in the main paper.

glass tinsel Sep 18, 2023, 2:02 AM

#

keen pivot

Yes I think it's quite important for your case.

keen pivot Sep 18, 2023, 2:03 AM

#

Maybe we can convince Aidan to hot-swap it in for ICLR submission.

glass tinsel Sep 18, 2023, 2:03 AM

#

And I mean, maybe I'm just misunderstanding something or I'm biased toward my own method, but on priors it just seems really bizarre that the reconstruction loss-based dictionary would Pareto dominate LEACE here. I don't see how this could happen, so I'm still worried something is wrong in the experimental setup

#

if you're using an end-to-end loss, it makes a lot of sense. you're directly minimizing the effect on the model when learning the dictionary

#

but this just seems like magic

#

On another note, it seems like the end-to-end loss is strictly better across the board, and you have it working, so I'm a bit confused why you didn't exclusively use that for the paper

keen pivot Sep 18, 2023, 2:06 AM

#

glass tinsel On another note, it seems like the end-to-end loss is strictly better across the...

~~Project ~~Future creep

glass tinsel Sep 18, 2023, 2:06 AM

#

I mean, do you have all the results for end to end?

#

You'll probably have a better chance to get in at ICLR with it!

#

There's nothing to lose!

#

It makes more sense, and is more in line with the tuned lens stuff we already did!

keen pivot Sep 18, 2023, 2:07 AM

#

glass tinsel I mean, do you have all the results for end to end?

Nope! Don't even have the code base properly set up to run it easily

#

But yes! It would be much better!

glass tinsel Sep 18, 2023, 2:08 AM

#

okay if I had me or someone else dedicate a day to fixing it

#

like

#

what would it take to just switch this lol

#

because I feel like this is fairly important

keen pivot Sep 18, 2023, 2:18 AM

#

I expect we could code it & re-do all results in a week, which is in time for the deadline.

#

In details, imo:

slight rewrite of intro for loss function
(possibly) Re-run auto-interp results (lowest priority)
Rerun concept erasure (high priority)
Rerun IOI (med priority)
Rerun features on dictionary (med priority)
Rerun Auto-circuit (in the works already, so no big deal)
Extra section (or appendix cause 9 pages) showing perplexity-under-reconstruction.

#

Code:

(Pretrain on reconstruction only) function that loads in a model & runs on KL-divergence

Ya, I think separating a pretraining & KL is good because of the extra compute cost. Larger models might not even be possible to train on KL given GPU constraints (or a headache if it's across clusters).

keen pivot Sep 18, 2023, 2:25 AM

#

keen pivot In details, imo: 1. slight rewrite of intro for loss function 2. (possibly) Re-...

No commitment though! I'll mention it to them tomorrow morning when we meet up.

glass tinsel Sep 18, 2023, 2:39 AM

#

So I kind of think you're implementing LEACE wrong because I would expect both LEACE and diff-in-means to achieve basically 0.5 prediction ability by the last couple layers. I see that with diff-in-means but not LEACE.

Captura_de_pantalla_2023-09-17_a_las_7.37.27_p.m..png

#

More generally like, I wouldn't expect LEACE and diff in means to be dramatically different

#

@bitter turtle Did you ever sanity check that setting method="orth" and affine=False on LeaceFitter gives the same result as diff in means?

#

the dict feature curve here is highly noisy. like if you compute the area under the curve it's not obvious that dict features beat diff in means ablation

Captura_de_pantalla_2023-09-17_a_las_7.44.55_p.m..png

#

also, LEACE doing super well at layer 2 but not at later layers also seems sus

#

in general I wouldn't really expect any of these methods to do well at all at early layers

#

because the model is going to recoup performance

glass tinsel Sep 18, 2023, 3:13 AM

#

if y'all have like, a particular script or notebook you're using to generate the results I'd be interested in inspecting it

glass tinsel Sep 18, 2023, 3:15 AM

#

glass tinsel in general I wouldn't really expect any of these methods to do well at all at ea...

otoh, I guess it depends on the prompt a bit

scenic bolt Sep 18, 2023, 6:12 AM

#

https://twitter.com/_akhaliq/status/1703600599722279400 is this the paper for this thread?

AK (@_akhaliq)

Sparse Autoencoders Find Highly Interpretable Features in Language Models

paper page: https://t.co/0zrBV222od

One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct…

glass tinsel Sep 18, 2023, 7:39 AM

#

scenic bolt https://twitter.com/_akhaliq/status/1703600599722279400 is this the paper for th...

I gotta be honest it's stressing me out that AK tweeted this out when I feel really unconfident in the results

#

can't be retracted now

#

Going forward just... if y'all are going to put an EleutherAI affiliation on an interp paper I'd like to review it okay? I think there was some miscommunication about like, the extent to which I had seen this

bitter turtle Sep 18, 2023, 8:09 AM

#

glass tinsel <@332271551481118732> Did you ever sanity check that setting `method="orth"` and...

yes, it produced exactly the same curve as diff-in-means.

bitter turtle Sep 18, 2023, 8:09 AM

#

glass tinsel Going forward just... if y'all are going to put an EleutherAI affiliation on an ...

very sorry about this.

glass tinsel Sep 18, 2023, 8:10 AM

#

bitter turtle yes, it produced exactly the same curve as diff-in-means.

did you do orth and affine True?

#

just trying to get to the bottom of the actual difference

bitter turtle Sep 18, 2023, 8:12 AM

#

glass tinsel if y'all have like, a particular script or notebook you're using to generate the...

https://github.com/Baidicoot/sparse_coding/blob/main/erasure.py

GitHub

sparse_coding/erasure.py at main · Baidicoot/sparse_coding

Work on sparse coding, replicating and extending the sparse coding approach to taking transformer features out of superposition. - Baidicoot/sparse_coding

#

maybe relevantly I did a comparison against a randomly-initialised dictionary baseline, (also will do one against an l1_alpha=0 baseline)

glass tinsel Sep 18, 2023, 8:15 AM

#

bitter turtle https://github.com/Baidicoot/sparse_coding/blob/main/erasure.py

ohhh my

bitter turtle Sep 18, 2023, 8:15 AM

#

it's a bit... h

glass tinsel Sep 18, 2023, 8:16 AM

#

glass tinsel did you do orth and affine True?

wait so did you do this experiment?

bitter turtle Sep 18, 2023, 8:16 AM

#

no, hotfixing it in now

glass tinsel Sep 18, 2023, 8:16 AM

#

okay cool thank you

bitter turtle Sep 18, 2023, 8:18 AM

#

affine does seem like the kind of thing which would explain the weirdness

glass tinsel Sep 18, 2023, 8:18 AM

#

yeah I mean

#

~~I know you did the KL plot~~

glass tinsel Sep 18, 2023, 8:20 AM

#

glass tinsel So I kind of think you're implementing LEACE wrong because I would expect both L...

honestly the weirdest plot is this one

#

because... how is the model getting 0.6 acc when you're making the genders linearly indistinguishable at layer 22

#

with leace

#

but then not with diff in means

#

how are you numbering the layers exactly? is zero the output of the first transformer layer or the embeddings

bitter turtle Sep 18, 2023, 8:22 AM

#

layer 0 is output of first layer

glass tinsel Sep 18, 2023, 8:22 AM

#

so layer 23 would be the thing that gets fed into the unembedding?

bitter turtle Sep 18, 2023, 8:25 AM

#

maybe important detail; we use few-shot prompting, and are only fitting LEACE to the last prompt (the one we actually want to intervene on, and have cleanish labelling for)

errant nova Sep 18, 2023, 8:26 AM

#

Why?

glass tinsel Sep 18, 2023, 8:27 AM

#

bitter turtle maybe important detail; we use few-shot prompting, and are only fitting LEACE to...

do you do the same for diff in means?

bitter turtle Sep 18, 2023, 8:27 AM

#

yes

#

and dict as well

bitter turtle Sep 18, 2023, 8:28 AM

#

errant nova Why?

why we use few-shot or why we only intervene on the final prompt

glass tinsel Sep 18, 2023, 8:29 AM

#

What do you mean by final prompt

bitter turtle Sep 18, 2023, 8:30 AM

#

so like the task might be pronoun prediction, and we go
Bob went to the store where he bought a cat. Carol went to the store where she bought a dog. Rob went to the store where [ask it for completion]
the final prompt begins at Rob

glass tinsel Sep 18, 2023, 8:31 AM

#

okay so you're not talking about separate prompt templates or anything like that

bitter turtle Sep 18, 2023, 8:31 AM

#

nono

glass tinsel Sep 18, 2023, 8:41 AM

#

glass tinsel So I kind of think you're implementing LEACE wrong because I would expect both L...

okay so if you don't mind, my two main requests for empirical results are:

run this with method="orth", affine=True on LeaceFitter
show layer 23

bitter turtle Sep 18, 2023, 8:46 AM

#

can do

bitter turtle Sep 18, 2023, 9:03 AM

#

glass tinsel okay so if you don't mind, my two main requests for empirical results are: 1) ru...

#

this is normal, orth+not affine, orth+affine

glass tinsel Sep 18, 2023, 9:04 AM

#

I'm about to pass out bc it's 2 AM here but

glass tinsel Sep 18, 2023, 9:06 AM

#

bitter turtle

the mean prediction ability can't be right? the unembedding is a linear classifier and we just prove in the LEACE paper that you can't do better than chance for any convex loss anyway if the means of the classes are equal

#

and while acc isn't convex

#

it would be pretty bizarre

#

if it were that different

bitter turtle Sep 18, 2023, 9:06 AM

#

~~yep~~ position of intervention weirdness?

glass tinsel Sep 18, 2023, 9:07 AM

#

ok going to sleep now

bitter turtle Sep 18, 2023, 9:15 AM

#

OTOH if you only fit on & ablate the last token position instead of flattening & treating as IID & intervening on all, you have the expected behaviour @glass tinsel

#

I feel like this could explain the difference

#

I did debate only intervening on the last position for a while ages ago, because of the LEACE performance weirdness

#

but I didn't because I am viewing this as a form of deep steering; this conceptualisation of it should be stressed more in the paper, this is an oversight on my part

fading valley Sep 18, 2023, 10:23 AM

#

@pallid current for this post: https://www.lesswrong.com/posts/wqRqb7h6ZC48iDgfK/tentatively-found-600-monosemantic-features-in-a-small-lm

Am I right in saying that dictionary 0 is just another portion of decoder weights but for a way smaller dictionary size?

(tentatively) Found 600+ Monosemantic Features in a Small LM Using ...

Using a sparse autoencoder, I present evidence that the resulting decoder (aka "dictionary") learned 600+ features for Pythia-70M layer_2's mid-MLP (…

#

Sorry just realized you weren't the author of that post

pallid current Sep 18, 2023, 10:25 AM

#

yeah logan wrote it, but yeah for that post dictionary 0 and dictionary 1 are just any two sets of decoder weights trained on activations for the same layer of the same model

#

often the pattern is that dict 0 is a smaller dict being compared to the next size up but there's nothing fundamental about that, we're just trying to understand whether multiple similar sparse autoencoders are learning the same direction as a proxy for that direction being a good feature

fading valley Sep 18, 2023, 10:26 AM

#

Makes sense, thanks

bitter turtle Sep 18, 2023, 10:48 AM

#

glass tinsel the mean prediction ability can't be right? the unembedding is a linear classifi...

yes, but as we do interventions on all token positions the classifier isn't strictly linear; the reason we do interventions at all token positions is because of the framing as deep guidance

bitter turtle Sep 18, 2023, 1:45 PM

#

this feels like more of a dataset/task issue than anything

glass tinsel Sep 18, 2023, 3:03 PM

#

bitter turtle OTOH if you only fit on & ablate the last token position instead of flattening &...

GOOOD I’m not going crazy

glass tinsel Sep 18, 2023, 3:06 PM

#

bitter turtle I feel like this could explain the difference

Absolutely

glass tinsel Sep 18, 2023, 3:08 PM

#

bitter turtle yes, but as we do interventions on all token positions the classifier isn't stri...

I don’t think I fully grokked this before. I know you said something about fitting on all token positions at one point but I didn’t know that’s how you were doing it for this particular experiment

#

I don’t think fitting and intervening on all residual stream positions is the way to do deep steering

#

If anything deep steering should be done in the key value cache

bitter turtle Sep 18, 2023, 3:38 PM

#

glass tinsel GOOOD I’m not going crazy

v happy i am not either tbh

bitter turtle Sep 18, 2023, 3:39 PM

#

glass tinsel I don’t think I fully grokked this before. I know you said something about fitti...

ah, sorry for miscommunication

bitter turtle Sep 18, 2023, 3:39 PM

#

glass tinsel If anything deep steering should be done in the key value cache

I think this is plausibly doable for the experiment? we have tried some dicts on non-residual-stream-stuff

glass tinsel Sep 18, 2023, 3:47 PM

#

bitter turtle I think this is plausibly doable for the experiment? we have tried some dicts on...

Yeah I'd support trying this

#

with the end to end loss

#

lol

glass tinsel Sep 18, 2023, 4:11 PM

#

is this confounded by the fact that dictionary features are "smaller" than PCA features bc they form an overcomplete basis?

Captura_de_pantalla_2023-09-18_a_las_9.10.12_a.m..png

bitter turtle Sep 18, 2023, 4:16 PM

#

glass tinsel is this confounded by the fact that dictionary features are "smaller" than PCA f...

Yes, but we are able to achieve better performance in terms of amount-of-activation-norm-changed

I'd like to rerun these results with positive-only-pca as well.

bitter turtle Sep 18, 2023, 4:17 PM

#

glass tinsel I don’t think fitting and intervening on all _residual stream_ positions is the ...

Could you explain more of your intuitions about this btw?

glass tinsel Sep 18, 2023, 4:18 PM

#

bitter turtle Could you explain more of your intuitions about this btw?

So for one thing, changing stuff in the residual stream means you are "directly" changing the token predictions through the identity branch, in addition to changing the transformer layer outputs

#

which is likely going to cause more changes than you really want

#

More generally I think we should be trying to figure out ways to only do an erasure or concept edit when you really need to

#

Perhaps using Mahalanobis distance from the training distribution of the concept eraser/editor

bitter turtle Sep 18, 2023, 4:45 PM

#

glass tinsel More generally I think we should be trying to figure out ways to only do an eras...

mhmm, this seems reasonable; I guess with sparsely activating features you can kind of target this by only intervening when the activation is nonzero

glass tinsel Sep 18, 2023, 4:45 PM

#

wait wait wait

Captura_de_pantalla_2023-09-18_a_las_9.45.08_a.m..png

#

so when you rank them

#

are you ranking based on the final token position

bitter turtle Sep 18, 2023, 4:46 PM

#

yes, this is highly problematic and i kind of hate this experiment etc etc aeeeeeeee

glass tinsel Sep 18, 2023, 4:46 PM

#

Yes this experiment should be removed ASAP

#

it's totally misleading tbh

#

because like

glass tinsel Sep 18, 2023, 4:46 PM

#

bitter turtle yes, this is highly problematic and i kind of hate this experiment etc etc aeeee...

this causes you to basically find the dictionary feature that is closest to the LEACE or diff in meands direction on the last token position

#

which obviously is going to do better

#

and then you compare against LEACE and diff in means fit on the whole sequence

bitter turtle Sep 18, 2023, 4:47 PM

#

how would you recommend addressing this

glass tinsel Sep 18, 2023, 4:47 PM

#

step one is to update the arxiv paper today removing the whole section

bitter turtle Sep 18, 2023, 4:47 PM

#

oh, ofc

glass tinsel Sep 18, 2023, 4:47 PM

#

step two: let me think a sec

bitter turtle Sep 18, 2023, 4:47 PM

#

other than that

bitter turtle Sep 18, 2023, 4:48 PM

#

glass tinsel step one is to update the arxiv paper today removing the whole section

I think it's clear that this should happen by now, but I could have really done with having this feedback earlier

glass tinsel Sep 18, 2023, 4:49 PM

#

yep, I would have given the feedback earlier if I knew the paper was going to be put on arxiv

#

in any case

#

we will do better in the future

bitter turtle Sep 18, 2023, 4:49 PM

#

communication is cursed

glass tinsel Sep 18, 2023, 4:53 PM

#

So, I think this is going to be a bit tricky in general because LEACE and diff in means are sort of designed for concept erasure out of the box, whereas with dictionary features you need to do extra optimization

bitter turtle Sep 18, 2023, 4:53 PM

#

for sure

glass tinsel Sep 18, 2023, 4:53 PM

#

but at minimum, the type of optimization you do for dictionary features

#

should be like, basically identical to the kind you do for LEACE

#

it should be the same objective

#

that is being optimized

#

either that or like

#

you need to have some argument

#

that hety

#

hey*

#

in the actual world

#

there's some reason you can't do this optimization problem with LEACE but you can with dictionary features

#

or smth

#

I expect that in any "fair fight" between LEACE and reconstruction loss based dictionary features LEACE is just going to win

#

your hope for beating LEACE is to use end to end loss

glass tinsel Sep 18, 2023, 4:57 PM

#

glass tinsel I expect that in any "fair fight" between LEACE and _reconstruction loss based_ ...

in part because dictionary features are restricting themselves to orthogonal projections

#

it might also help to like

#

take the dictionary subspace

#

say, this is the "concept subspace" to neutralize

#

then LEACE that

#

because that'll get you better surgicality

#

I still think it'll do worse than LEAC'ing away gender directly if your dictionary is based on reconstruction loss

#

but it might help with beating LEACE when you rerun the experiment with end to end loss

#

on another note, I don't actually understand why we're using an orthogonal projection in the residual stream here, it seems like you could do some galaxy brained thing to make an even smaller edit using the overcomplete basis

glass tinsel Sep 18, 2023, 5:01 PM

#

glass tinsel on another note, I don't actually understand why we're using an orthogonal proje...

this would actually be fascinating

bitter turtle Sep 18, 2023, 5:02 PM

#

glass tinsel this would actually be fascinating

literally just ease and algorithmic complexity; this motivates the IOI feature identification experiment

glass tinsel Sep 18, 2023, 5:02 PM

#

do we know of a way of doing overcomplete projections

#

if I sat down and thought about it I could probably derive something

bitter turtle Sep 18, 2023, 5:03 PM

#

glass tinsel do we know of a way of doing overcomplete projections

wdym

#

specifically

glass tinsel Sep 18, 2023, 5:03 PM

#

I'm not totally sure, but some edit that actually makes use of the overcomplete basis

pallid current Sep 18, 2023, 5:03 PM

#

i dont know what the prior should be on the dict_feature vs LEACE or diff in means but i do think there's a clear potential reason why learned features would win - because they take advantage of pretraining to find semantically meaningful features, so even with a small sample size you might be able to grab exactly the right direction, while with a sample size of 30 in a 512 dimensional space, your diff-in-means or LEACE direction will have a lot of noise

#

(when moving to a separate test set)

glass tinsel Sep 18, 2023, 5:04 PM

#

pallid current i dont know what the prior should be on the dict_feature vs LEACE or diff in mea...

now this is just confusing me, is this another confounder 👀

#

you can pretrain LEACE too

bitter turtle Sep 18, 2023, 5:05 PM

#

in a deep way?

glass tinsel Sep 18, 2023, 5:05 PM

#

yes

bitter turtle Sep 18, 2023, 5:05 PM

#

how

glass tinsel Sep 18, 2023, 5:05 PM

#

hold on I have a call rn

bitter turtle Sep 18, 2023, 5:05 PM

#

ok

glass tinsel Sep 18, 2023, 6:37 PM

#

Ok so there’s a few things you can do

#

The most annoying part of fitting a LEACE eraser is estimating the covariance matrix of X because it’s O(d^2) parameters

bitter turtle Sep 18, 2023, 6:40 PM

#

I'm more concerned about how you label things accurately in the middle of a model

glass tinsel Sep 18, 2023, 6:41 PM

#

We use this thing https://arxiv.org/abs/1308.2608

arXiv.org

On the Strong Convergence of the Optimal Linear Shrinkage Estimator...

In this work we construct an optimal linear shrinkage estimator for the covariance matrix in high dimensions. The recent results from the random matrix theory allow us to find the asymptotic deterministic equivalents of the optimal shrinkage intensities and estimate them consistently. The developed distribution-free estimators obey almost surely...

bitter turtle Sep 18, 2023, 6:41 PM

#

Estimating a covariance matrix is fine mod weird instabilities and accidental negative eigenvalues, right, or is that the issue?

glass tinsel Sep 18, 2023, 6:41 PM

#

It’s not really negative eigenvalues, you can ensure it’s always psd

#

The sample covariance matrix is low rank if the sample size is less than the dimension though

glass tinsel Sep 18, 2023, 6:42 PM

#

glass tinsel We use this thing https://arxiv.org/abs/1308.2608

This thing has a hyperparameter in it which is the matrix you shrink toward

#

Like when you have near zero samples what do you assume the covariance matrix is

#

The “uninformative” thing to shrink toward, which is what we do, is the identity matrix times the trace of the sample covariance matrix

#

But you could shrink toward any psd matrix under weak conditions

#

Including a sample covariance matrix estimated on the Pile

#

Or whatever

bitter turtle Sep 18, 2023, 6:44 PM

#

How would you go about accurately labelling activations when run on the Pile though, if you are doing an intervention at e.g. layer 12/24?

glass tinsel Sep 18, 2023, 6:45 PM

#

The nice thing about this is that you simply don’t need labels

#

This is totally unsupervised

#

It’s just for the covariance matrix of X

#

And then you estimate the cross covariance of X and Z on the labeled data

bitter turtle Sep 18, 2023, 6:45 PM

#

Ah, sick

#

That's extremely cool

glass tinsel Sep 18, 2023, 6:46 PM

#

Rn the library doesn’t let you do this but it wouldn’t be hard to add

#

Would require modifying shrinkage.py slightly, I can do that today

#

Could also add a routine for LEACing an arbitrary subspace

bitter turtle Sep 18, 2023, 6:48 PM

#

glass tinsel So, I think this is going to be a bit tricky in general because LEACE and diff i...

then I guess the other issue is working out how to resolve this

glass tinsel Sep 18, 2023, 6:49 PM

#

I want to help you guys show that end to end sparse coding is useful, if it is, which I hope it is

bitter turtle Sep 18, 2023, 6:54 PM

#

bitter turtle OTOH if you only fit on & ablate the last token position instead of flattening &...

@bronze wraith

bitter turtle Sep 18, 2023, 7:56 PM

#

glass tinsel is this confounded by the fact that dictionary features are "smaller" than PCA f...

this one

glass tinsel Sep 18, 2023, 8:46 PM

#

I created #sparse-coding if y'all want to use it

#

also if you hate the name lmk

#

I wanted it to be more general than just sparse coding

#

@keen pivot said he wanted threads

bitter turtle Sep 21, 2023, 7:50 PM

#

@pallid current @keen pivot check #behind-the-scenes

dense thorn Sep 22, 2023, 12:45 PM

#

Hey all, I’m Rob 👋. I’ve been doing some independent research on sparse coding on and off for a few months. I just stumbled across this channel a few days ago and would be really excited to get involved.

I also wanted to say, congrats! I saw the paper recently put up buy this group on arXiv, hopefully there’s not too much work left. I would be curious to know if this group plans to continue researching in this area after or if there are other groups you would recommend reaching out to?

keen pivot Sep 22, 2023, 1:56 PM

#

dense thorn Hey all, I’m Rob 👋. I’ve been doing some independent research on sparse coding ...

Hey r0bk! There's definitely plenty of work left to do. What are your specific research interests?

#

(I have a list of my own future work directions in: https://www.lesswrong.com/posts/CkFBMG6A9ytkiXBDM/sparse-autoencoders-future-work)

Sparse Autoencoders: Future Work — LessWrong

Mostly my own writing, except for the 'Better Training Methods' section which was written by @Aidan Ewart. …

keen pivot Sep 23, 2023, 5:30 PM

#

https://www.lesswrong.com/posts/YJpMgi7HJuHwXTkjk/taking-features-out-of-superposition-with-sparse

Taking features out of superposition with sparse autoencoders more ...

This work was produced as part of the SERI MATS 3.0 Cohort under the supervision of Lee Sharkey. …

keen pivot Sep 23, 2023, 6:07 PM

#

@pallid current I keep getting feedback on the paper about “it’s a shame the method doesn’t work out for the last layer”. I’d be up for some human interp benchline. Probably not in time for ICLR, but could at least post it or add it to the paper if reviewers say similar things

dense thorn Sep 24, 2023, 12:29 PM

#

keen pivot Hey r0bk! There's definitely plenty of work left to do. What are your specific r...

Hey Logan! I took a read through your future work post and out of those research areas I’d say I’m most interested in the “Circuits across time”, “ACDC” and “Better Sparse Autoencoders” directions. Most of my focus to date mainly falls into that last bucket, specifically on attempting to identify if dictionary features found are monosemantic through relational metrics and on optimising the sparse auto encoder training for more monosemanticy (both have shown some early potential in toy models). But with that said, the other two areas have been on my mind a lot. Maybe a few quick questions from my side:

Is there any areas (aforementioned or otherwise) that would be particularly useful for the group to look into at the moment?
Reading through the chat history in this channel I’ve seen a few codebases posted, what would be the best to play with if I wanted to get aligned with what the group has already done?

bitter turtle Sep 24, 2023, 1:34 PM

#

dense thorn Hey Logan! I took a read through your future work post and out of those research...

(I am a coauthor/contributor to the gh repo you mentioned) awesome! easiest way to run an ensemble of dicts is basic_l1_sweep.py which we recently implemented. you generate activation data & run as follows:

# generate the activation data
> python generate_test_data.py --model="EleutherAI/pythia-70m-deduped" --layers 2 --n_chunks=10
# run a basic sweep (default l1_alpha range is 10^-4 to 10^-2 at 16 log-spaced intervals)
> python basic_l1_sweep.py --dataset_dir="activation_data/layer_2" --output_dir="output_basic_test" 
--ratio=4

#

if you've got a multi-gpu setup you can do training runs on all gpus simultaneously, check out big_sweep_experiments.py for examples of how to configure

bitter turtle Sep 24, 2023, 1:37 PM

#

dense thorn Hey Logan! I took a read through your future work post and out of those research...

I'd like to hear more about the monosemanticity metrics you are looking at (if I'm not misunderstanding, you are optimising for a monosemanticity metric right? if true that would be slightly crazy); I have a couple ideas but I haven't really found anything where the gradient signal isn't fucked

keen pivot Sep 24, 2023, 1:46 PM

#

dense thorn Hey Logan! I took a read through your future work post and out of those research...

Specifically this repo is what Aidan’s referring to: https://github.com/HoagyC/sparse_coding

GitHub

GitHub - HoagyC/sparse_coding: Using sparse coding to find distribu...

Using sparse coding to find distributed representations used by neural networks. - GitHub - HoagyC/sparse_coding: Using sparse coding to find distributed representations used by neural networks.

keen pivot Sep 24, 2023, 1:47 PM

#

dense thorn Hey Logan! I took a read through your future work post and out of those research...

Like Aidan, I’m very interested on your current ideas for relational metrics and how you optimize for monosemanticity.

Could you go into more details?

fluid kiln Oct 9, 2023, 12:23 PM

#

keen pivot Like Aidan, I’m very interested on your current ideas for relational metrics and...

+1

prime obsidian Oct 9, 2023, 7:02 PM

#

+1

#Sparse Coding