#Sparse Coding

5293 messages · Page 6 of 6 (latest)

bitter turtle
#

I'm implementing one now which ablates features based on their class-predicition ability, or conversely their damage to class-prediction when ablated

keen pivot
bitter turtle
#

yes, the activation frequency thing is basically just pointless here imo. output might be interesting.

keen pivot
#

Though I think this is just the way it is & it's fine? Like dictionaries allow you to select from multiple different features, and alternative methods don't.

bitter turtle
#

I did look at activations for ablated features in the previous run btw

keen pivot
bitter turtle
keen pivot
#

But what words they activate on

#

Yep

bitter turtle
#

still don't think that this is this useful, for the reasons you say "lame"

#

output cool however

keen pivot
#

I think the output is lame in the same reason: you're directly checking the test distribution (probably) when finding the effect on the output on e.g. Pile-10k.

bitter turtle
#

also turns out i've been incorrectly plotting activation edit amount this entire time, it should be for the other experiment

#

which is weird because it's more surgical than LEACE under that measure

#

this is for feature selection based off linear erasure amount, i.e. how badly a logistic regression model performs when trying to linearly discriminate based off the activations under the projection sending a dictionary component to nullspace

bitter turtle
keen pivot
bitter turtle
#

tried out transfer for a Very Shit dataset which is basically a rehash of the other one for pronoun instead of gender prediction

#

original dataset

#

transferred results

#

nowhere near robust but maybe promising

bitter turtle
#

@keen pivot I can dm you the dictionary indexes now

#

or in a few mins, sorry

#

Summarisation of current alg:

  • get 4 candidate directions via testing for erasure with linear binary classifiers
  • sort by perf on test dataset
  • measure perf on main and other dataset
bitter turtle
#

This is currently a Very Shit and Uncool algorithm, but I have hope that something similar would work with a more sanely designed autoencoder

bitter turtle
#

probably not that relevant anymore, but maybe useful for other people's stuff, I figured out that our previous attempts at larger/deeper autoencoders basically failed bc they had a fuckton of dead neurons at the final thresholding layer, but this can be remidied by using something like e.g. softplus instead of the ReLU

keen pivot
#

@bitter turtle , do you know how we're downloading data now that the pile's down? I basically want like 2 billion tokens, but openwebtext is like 8M datapoints (want more like 100k), and I don't know how to download it or to get streaming=True to do what I intuitively want.

bitter turtle
#

I am honestly Suffering, and just firming downloading openwebtext

#

there is definately a smarter way but I haven't Suffered enough to bother to look

keen pivot
#

I could just host a openwebtext-100k, which'd be one solution

bitter turtle
#

halfs for double activation size

keen pivot
bitter turtle
#

Sick

keen pivot
#

So I'm just setting off a run for training on openwebtext w/ gpt2 w/ KL & perplexity.

So maybe this will just converge to model's original perplexity and we're done as far as performance (everything else would just be converging more efficiently!)

#

Downloading the dataset took like 20 min. Chunking & tokenizing is like 1hr. I'll just check back on it tomorrow & hope there's no weird bug that happens (the wandb above is from running on pile-10k, so it works on that at least!)

bitter turtle
#

So maybe this will just converge to model's original perplexity and we're done as far as performance (everything else would just be converging more efficiently!)
Hmm, I wouldn't think so if we are trying to use these AEs to study the behaviour of a preexisting LM; I expect that it would be possible for the AE to improve perplexity by doing nontrivial computation rather than 'reconstructing activations to have the same semantic meaning' or whatever. This is fine for studying e.g. a model trained with many high-rank sparse disentanglement layers as part of it's arch (like CRATE or something https://arxiv.org/abs/2306.01129), which is for sure something I want to do, but I worry it wouldn't be as useful for studying the non-sparse-expanded model @keen pivot

bitter turtle
#

Sorry for hogging GPUs @keen pivot

#

doing acdc-type-stuff with dicts is expensive asf

keen pivot
bitter turtle
keen pivot
bitter turtle
#

internals might be different

#

seems significantly less likely/problematic tho

#

as in practically irrelevantly problematic

keen pivot
#

Do you know what data the gpt2 models were trained on? Pile or openwebtext?

bitter turtle
#

not a clue sorry

keen pivot
#

Perplexity for gpt2: 28
Perplexity for dictionary: 32

This is before training on KL on openwebtext. That's pretty good! On Pile-10k it was (30, 40) for (gpt2, dict-base)

bitter turtle
#

got these done for layers 4,6..18, took an insane amount of time (hours)

#

will probably use 12 because it looks the most aesthetic/illustrates what I want to illustrate the best, will include others in appendix or smthn maybe

bitter turtle
#

@bronze wraith do these look better in B+W?

#

I'm worried that they aren't that distinguishable

#

(less bumpy one is normal dataset, more bumpy one is untrained transfer)

bronze wraith
keen pivot
# keen pivot Perplexity for gpt2: 28 Perplexity for dictionary: 32 This is before training o...

After training directly on KL & reconstruction we went from 32.6->30.6 (it converged!) with 28 being the original perplexity. Very good! lol.

I want to check which datapoints are badly represented to see if there's a trend (though, it might be because it's not a big enough dictionary to capture all the features! currently doing a 6x dictionary)

The larger perplexity difference last time was (probably) due to the Pile-10k which was a different distribution than Openwebtext

keen pivot
#

Running one on 16x ratio & checking # of dead features now for both 8x & 16x (currently 16x has 4k-5k, dead which is ~40% of the features. Definitely need to do that soft ReLU that @bitter turtle mentioned)

bitter turtle
#

mhmm would be interesting

glass tinsel
#

Remark: if you LEACE away the gender concept in the residual stream at layer i, the tuned lens output at layer i will have no better than chance gender prediction accuracy. So if the final layer output is better than chance this has to mean that later layers are recovering the gender info in their residual updates.

#

You could track this process with the tuned lens

bitter turtle
#

I expect a linear probe to be able to discern gender under the dictionary ablation, but I'm not sure what tuned lens would do.

bitter turtle
#

Between purely ReLU and softplus

#

For our single-layer models

bitter turtle
#

@pallid current would it be possible to do an ICA run for pythia-410m at layers [0,2,4,6,8,10,12,14,16,18,20,22], or would that take too long to converge?

bitter turtle
#

Trouble is no-one has really done decomp of residual stream measured like this so I am essentially making up baselines

bitter turtle
#

Maybe it would be interesting to contrast both

bitter turtle
#

putting this here for tomorrow

dusk hatch
#

Hi 👋 I was pointed to sparse coding / superposition. I'm wondering if there is a canonical reference on the models people are fitting?
For context, I read some papers and may have some outsider insights. I don't want to wade in without checking how far people are doing the same things under sparse coding / superposition.

bronze wraith
dusk hatch
#

Thanks for clarifying!

keen pivot
bitter turtle
#

@keen pivot

glass tinsel
#

So, I'm still pretty worried / unconvinced by the experiments in Section 4 of the arXiv paper tbh

#

An alternative hypothesis that explains the data is something like "the dictionary ablation is messing up the model's capabilities more overall"

#

and the mean squared error doesn't necessarily capture this. one of the nice things about LEACE is that it actually minimizes a bunch of different squared error metrics all at once

keen pivot
#

It's sadly in the appendix, though I've wanted it in the main paper.

glass tinsel
keen pivot
#

Maybe we can convince Aidan to hot-swap it in for ICLR submission.

glass tinsel
#

And I mean, maybe I'm just misunderstanding something or I'm biased toward my own method, but on priors it just seems really bizarre that the reconstruction loss-based dictionary would Pareto dominate LEACE here. I don't see how this could happen, so I'm still worried something is wrong in the experimental setup

#

if you're using an end-to-end loss, it makes a lot of sense. you're directly minimizing the effect on the model when learning the dictionary

#

but this just seems like magic

#

On another note, it seems like the end-to-end loss is strictly better across the board, and you have it working, so I'm a bit confused why you didn't exclusively use that for the paper

glass tinsel
#

I mean, do you have all the results for end to end?

#

You'll probably have a better chance to get in at ICLR with it!

#

There's nothing to lose!

#

It makes more sense, and is more in line with the tuned lens stuff we already did!

keen pivot
#

But yes! It would be much better!

glass tinsel
#

okay if I had me or someone else dedicate a day to fixing it

#

like

#

what would it take to just switch this lol

#

because I feel like this is fairly important

keen pivot
#

I expect we could code it & re-do all results in a week, which is in time for the deadline.

#

In details, imo:

  1. slight rewrite of intro for loss function
  2. (possibly) Re-run auto-interp results (lowest priority)
  3. Rerun concept erasure (high priority)
  4. Rerun IOI (med priority)
  5. Rerun features on dictionary (med priority)
  6. Rerun Auto-circuit (in the works already, so no big deal)
  7. Extra section (or appendix cause 9 pages) showing perplexity-under-reconstruction.
#

Code:

  1. (Pretrain on reconstruction only) function that loads in a model & runs on KL-divergence

Ya, I think separating a pretraining & KL is good because of the extra compute cost. Larger models might not even be possible to train on KL given GPU constraints (or a headache if it's across clusters).

keen pivot
glass tinsel
#

So I kind of think you're implementing LEACE wrong because I would expect both LEACE and diff-in-means to achieve basically 0.5 prediction ability by the last couple layers. I see that with diff-in-means but not LEACE.

#

More generally like, I wouldn't expect LEACE and diff in means to be dramatically different

#

@bitter turtle Did you ever sanity check that setting method="orth" and affine=False on LeaceFitter gives the same result as diff in means?

#

the dict feature curve here is highly noisy. like if you compute the area under the curve it's not obvious that dict features beat diff in means ablation

#

also, LEACE doing super well at layer 2 but not at later layers also seems sus

#

in general I wouldn't really expect any of these methods to do well at all at early layers

#

because the model is going to recoup performance

glass tinsel
#

if y'all have like, a particular script or notebook you're using to generate the results I'd be interested in inspecting it

glass tinsel
scenic bolt
glass tinsel
#

can't be retracted now

#

Going forward just... if y'all are going to put an EleutherAI affiliation on an interp paper I'd like to review it okay? I think there was some miscommunication about like, the extent to which I had seen this

bitter turtle
glass tinsel
#

just trying to get to the bottom of the actual difference

bitter turtle
#

maybe relevantly I did a comparison against a randomly-initialised dictionary baseline, (also will do one against an l1_alpha=0 baseline)

bitter turtle
#

it's a bit... h

glass tinsel
bitter turtle
#

no, hotfixing it in now

glass tinsel
#

okay cool thank you

bitter turtle
#

affine does seem like the kind of thing which would explain the weirdness

glass tinsel
#

yeah I mean

#

I know you did the KL plot

glass tinsel
#

because... how is the model getting 0.6 acc when you're making the genders linearly indistinguishable at layer 22

#

with leace

#

but then not with diff in means

#

how are you numbering the layers exactly? is zero the output of the first transformer layer or the embeddings

bitter turtle
#

layer 0 is output of first layer

glass tinsel
#

so layer 23 would be the thing that gets fed into the unembedding?

bitter turtle
#

maybe important detail; we use few-shot prompting, and are only fitting LEACE to the last prompt (the one we actually want to intervene on, and have cleanish labelling for)

errant nova
#

Why?

glass tinsel
bitter turtle
#

yes

#

and dict as well

bitter turtle
glass tinsel
#

What do you mean by final prompt

bitter turtle
#

so like the task might be pronoun prediction, and we go
Bob went to the store where he bought a cat. Carol went to the store where she bought a dog. Rob went to the store where [ask it for completion]
the final prompt begins at Rob

glass tinsel
#

okay so you're not talking about separate prompt templates or anything like that

bitter turtle
#

nono

glass tinsel
bitter turtle
#

can do

bitter turtle
#

this is normal, orth+not affine, orth+affine

glass tinsel
#

I'm about to pass out bc it's 2 AM here but

glass tinsel
# bitter turtle

the mean prediction ability can't be right? the unembedding is a linear classifier and we just prove in the LEACE paper that you can't do better than chance for any convex loss anyway if the means of the classes are equal

#

and while acc isn't convex

#

it would be pretty bizarre

#

if it were that different

bitter turtle
#

yep position of intervention weirdness?

glass tinsel
#

ok going to sleep now

bitter turtle
#

OTOH if you only fit on & ablate the last token position instead of flattening & treating as IID & intervening on all, you have the expected behaviour @glass tinsel

#

I feel like this could explain the difference

#

I did debate only intervening on the last position for a while ages ago, because of the LEACE performance weirdness

#

but I didn't because I am viewing this as a form of deep steering; this conceptualisation of it should be stressed more in the paper, this is an oversight on my part

fading valley
#

Sorry just realized you weren't the author of that post

pallid current
#

yeah logan wrote it, but yeah for that post dictionary 0 and dictionary 1 are just any two sets of decoder weights trained on activations for the same layer of the same model

#

often the pattern is that dict 0 is a smaller dict being compared to the next size up but there's nothing fundamental about that, we're just trying to understand whether multiple similar sparse autoencoders are learning the same direction as a proxy for that direction being a good feature

fading valley
#

Makes sense, thanks

bitter turtle
bitter turtle
#

this feels like more of a dataset/task issue than anything

glass tinsel
glass tinsel
#

I don’t think fitting and intervening on all residual stream positions is the way to do deep steering

#

If anything deep steering should be done in the key value cache

bitter turtle
bitter turtle
bitter turtle
glass tinsel
#

with the end to end loss

#

lol

glass tinsel
#

is this confounded by the fact that dictionary features are "smaller" than PCA features bc they form an overcomplete basis?

bitter turtle
bitter turtle
glass tinsel
#

which is likely going to cause more changes than you really want

#

More generally I think we should be trying to figure out ways to only do an erasure or concept edit when you really need to

#

Perhaps using Mahalanobis distance from the training distribution of the concept eraser/editor

bitter turtle
glass tinsel
#

wait wait wait

#

so when you rank them

#

are you ranking based on the final token position

bitter turtle
#

yes, this is highly problematic and i kind of hate this experiment etc etc aeeeeeeee

glass tinsel
#

Yes this experiment should be removed ASAP

#

it's totally misleading tbh

#

because like

glass tinsel
#

which obviously is going to do better

#

and then you compare against LEACE and diff in means fit on the whole sequence

bitter turtle
#

how would you recommend addressing this

glass tinsel
#

step one is to update the arxiv paper today removing the whole section

bitter turtle
#

oh, ofc

glass tinsel
#

step two: let me think a sec

bitter turtle
#

other than that

bitter turtle
glass tinsel
#

yep, I would have given the feedback earlier if I knew the paper was going to be put on arxiv

#

in any case

#

we will do better in the future

bitter turtle
#

communication is cursed

glass tinsel
#

So, I think this is going to be a bit tricky in general because LEACE and diff in means are sort of designed for concept erasure out of the box, whereas with dictionary features you need to do extra optimization

bitter turtle
#

for sure

glass tinsel
#

but at minimum, the type of optimization you do for dictionary features

#

should be like, basically identical to the kind you do for LEACE

#

it should be the same objective

#

that is being optimized

#

either that or like

#

you need to have some argument

#

that hety

#

hey*

#

in the actual world

#

there's some reason you can't do this optimization problem with LEACE but you can with dictionary features

#

or smth

#

I expect that in any "fair fight" between LEACE and reconstruction loss based dictionary features LEACE is just going to win

#

your hope for beating LEACE is to use end to end loss

glass tinsel
#

it might also help to like

#

take the dictionary subspace

#

say, this is the "concept subspace" to neutralize

#

then LEACE that

#

because that'll get you better surgicality

#

I still think it'll do worse than LEAC'ing away gender directly if your dictionary is based on reconstruction loss

#

but it might help with beating LEACE when you rerun the experiment with end to end loss

#

on another note, I don't actually understand why we're using an orthogonal projection in the residual stream here, it seems like you could do some galaxy brained thing to make an even smaller edit using the overcomplete basis

glass tinsel
bitter turtle
glass tinsel
#

do we know of a way of doing overcomplete projections

#

if I sat down and thought about it I could probably derive something

bitter turtle
#

specifically

glass tinsel
#

I'm not totally sure, but some edit that actually makes use of the overcomplete basis

pallid current
#

i dont know what the prior should be on the dict_feature vs LEACE or diff in means but i do think there's a clear potential reason why learned features would win - because they take advantage of pretraining to find semantically meaningful features, so even with a small sample size you might be able to grab exactly the right direction, while with a sample size of 30 in a 512 dimensional space, your diff-in-means or LEACE direction will have a lot of noise

#

(when moving to a separate test set)

glass tinsel
#

you can pretrain LEACE too

bitter turtle
#

in a deep way?

glass tinsel
#

yes

bitter turtle
#

how

glass tinsel
#

hold on I have a call rn

bitter turtle
#

ok

glass tinsel
#

Ok so there’s a few things you can do

#

The most annoying part of fitting a LEACE eraser is estimating the covariance matrix of X because it’s O(d^2) parameters

bitter turtle
#

I'm more concerned about how you label things accurately in the middle of a model

glass tinsel
bitter turtle
#

Estimating a covariance matrix is fine mod weird instabilities and accidental negative eigenvalues, right, or is that the issue?

glass tinsel
#

It’s not really negative eigenvalues, you can ensure it’s always psd

#

The sample covariance matrix is low rank if the sample size is less than the dimension though

glass tinsel
#

Like when you have near zero samples what do you assume the covariance matrix is

#

The “uninformative” thing to shrink toward, which is what we do, is the identity matrix times the trace of the sample covariance matrix

#

But you could shrink toward any psd matrix under weak conditions

#

Including a sample covariance matrix estimated on the Pile

#

Or whatever

bitter turtle
#

How would you go about accurately labelling activations when run on the Pile though, if you are doing an intervention at e.g. layer 12/24?

glass tinsel
#

The nice thing about this is that you simply don’t need labels

#

This is totally unsupervised

#

It’s just for the covariance matrix of X

#

And then you estimate the cross covariance of X and Z on the labeled data

bitter turtle
#

Ah, sick

#

That's extremely cool

glass tinsel
#

Rn the library doesn’t let you do this but it wouldn’t be hard to add

#

Would require modifying shrinkage.py slightly, I can do that today

#

Could also add a routine for LEACing an arbitrary subspace

bitter turtle
glass tinsel
#

I want to help you guys show that end to end sparse coding is useful, if it is, which I hope it is

glass tinsel
#

I created #sparse-coding if y'all want to use it

#

also if you hate the name lmk

#

I wanted it to be more general than just sparse coding

#

@keen pivot said he wanted threads

bitter turtle
#

@pallid current @keen pivot check #behind-the-scenes

dense thorn
#

Hey all, I’m Rob 👋. I’ve been doing some independent research on sparse coding on and off for a few months. I just stumbled across this channel a few days ago and would be really excited to get involved.

I also wanted to say, congrats! I saw the paper recently put up buy this group on arXiv, hopefully there’s not too much work left. I would be curious to know if this group plans to continue researching in this area after or if there are other groups you would recommend reaching out to?

keen pivot
keen pivot
keen pivot
#

@pallid current I keep getting feedback on the paper about “it’s a shame the method doesn’t work out for the last layer”. I’d be up for some human interp benchline. Probably not in time for ICLR, but could at least post it or add it to the paper if reviewers say similar things

dense thorn
# keen pivot Hey r0bk! There's definitely plenty of work left to do. What are your specific r...

Hey Logan! I took a read through your future work post and out of those research areas I’d say I’m most interested in the “Circuits across time”, “ACDC” and “Better Sparse Autoencoders” directions. Most of my focus to date mainly falls into that last bucket, specifically on attempting to identify if dictionary features found are monosemantic through relational metrics and on optimising the sparse auto encoder training for more monosemanticy (both have shown some early potential in toy models). But with that said, the other two areas have been on my mind a lot. Maybe a few quick questions from my side:

  • Is there any areas (aforementioned or otherwise) that would be particularly useful for the group to look into at the moment?
  • Reading through the chat history in this channel I’ve seen a few codebases posted, what would be the best to play with if I wanted to get aligned with what the group has already done?
bitter turtle
# dense thorn Hey Logan! I took a read through your future work post and out of those research...

(I am a coauthor/contributor to the gh repo you mentioned) awesome! easiest way to run an ensemble of dicts is basic_l1_sweep.py which we recently implemented. you generate activation data & run as follows:

# generate the activation data
> python generate_test_data.py --model="EleutherAI/pythia-70m-deduped" --layers 2 --n_chunks=10
# run a basic sweep (default l1_alpha range is 10^-4 to 10^-2 at 16 log-spaced intervals)
> python basic_l1_sweep.py --dataset_dir="activation_data/layer_2" --output_dir="output_basic_test" 
--ratio=4
#

if you've got a multi-gpu setup you can do training runs on all gpus simultaneously, check out big_sweep_experiments.py for examples of how to configure

bitter turtle
keen pivot
keen pivot
prime obsidian
#

+1