#Sparse Coding

1 messages · Page 3 of 1

bitter turtle
#

Ok, so I think you can get a long way through the standard lens of 'neural networks as bayesian optimisers' here (at least, informally); if you assume some information-minimisation prior you might be able to get to (something-approaching) superposition downstream of that. More generally, I don't think that a formalisation of this is particularly useful (it seems Highly complex, and empirics are good and we should use them) and I also don't see how 'removing superposition' is a paricularly useful approach (doing so would significantly impact performance, and the model would probably get around our training guardrails somehow (see Neel's SOLU stuff ig?))

keen pivot
#

@bitter turtle I'm not getting any high-mcs features for any of the dicts. I'm comparing dict_ratio_2 w/ dict_ratio_4, and have done tied & untied across all l1 values. It didn't seem to work for the known-to-work l1 of 1e-3, so 🤷

bitter turtle
#

Ey

#

hmmmmmmmm

keen pivot
#

I could've code something wrong. Have you been able to get any high-mcs features/graphs from this, even the toy?

bitter turtle
#

what is the non-mcs performance of the dicts like

keen pivot
bitter turtle
#

like loss on actual data, sparsity level etc and how does it compare to the other ones we did with the other trainer

#

if it's different something's wrong with the training code if it's the same something's wrong with your code or dictionaries are weird asf

keen pivot
bitter turtle
#

Bear in mind this is 8x~2GB chunks, could be a lack-of-data thing

bitter turtle
keen pivot
bitter turtle
#

Yeah

#

Well 15 I think

#

@pallid current did you do autointerp with directions from the new run code or an older run

bitter turtle
keen pivot
pallid current
bitter turtle
#

rats

#

what the hell is my thing doing the

keen pivot
bitter turtle
#

Oh, how are you scaling L1, I might not have implemented that properly, might wanna check the loss function implementations

keen pivot
bitter turtle
#

Like how exactly is the loss function implemented

#

I remember someone saying something about scaling by 1/J, and I'm wondering if I did that properly

#

@keen pivot

keen pivot
#

Ah, I see where you did that.

#

You actually don't want to do that because it causes the diagonal thing

#

So these l1-values are really low

#

I'm getting a sparsity of 600/10000

bitter turtle
#

Ah, that might explain it then

keen pivot
#

It might also explain the really low reconstruction loss, lol

bitter turtle
#

Could you look over the loss function when I reimplement it?

keen pivot
#

Yep!

bitter turtle
#

I'll just directly translate from the other one

#

Whoops paha

keen pivot
#

it looks like just:

    l_l1 = (buffers["l1_alpha"] / c.shape[0]) * torch.norm(c, 1, dim=-1).mean()

removing the c.shape for both the tied and untied

bitter turtle
#

Thank god it only takes like 16m to train lol

keen pivot
#

Also tracking sparsity would be useful

bitter turtle
#

yep will do

keen pivot
#

I think it's:

x_hat.count_nonzero(axis=1).float().mean()

bitter turtle
#

How would you like this measured?

#

ah just total

#

I might do that as well as 'num Nonzero per feature over last chunk' or something

keen pivot
#

Like per token, there's how many nonzero latent activations

#

The features/token metric helps check if we've set the l1 too high (zero activations) or too low (several hundreds, so mostly identity)

bitter turtle
bitter turtle
#

weird link

#

looks much more sparse now

#

but bloody hell these reconstruction losses

#

I've been thinking that we could using a more powerful encoder (a full-blown feed-forward multi-layer net) while keeping the decoder limited to a linear map

keen pivot
#

@bitter turtle, one thing, the 8 naming scheme includes "_group_1" and the others do not. Is that intentional cause the 8 sized ones are bigger and split across two GPU's?

bitter turtle
#

yes

keen pivot
#

Overall, the low reconstruction loss from earlier was caused by the high sparsity/near-identity dictionaries

bitter turtle
#

Well, my thoughts are that we should let the net do more intelligent denosising than just cross-producting with the dict; not sure what you mean by the future MLP layers don't have tied embedding

pallid current
#

looks like tied is getting way higher recon losses than untied 🤔 so confused about what extra computation it manages by not having them match up

bitter turtle
#

yeah that's why I think we should just slap on a big net

keen pivot
#

Also, I'm all for slapping the full net on now & running it, haha

bitter turtle
#

ah, sure

bitter turtle
keen pivot
#

Could skip tied for now

pallid current
#

slapping on the full net is also equivalent to the standard dictionary learning thing where you just freely optimize the dict entries

bitter turtle
#

ye, that's what I was thinking; might also be useful to have a deterministic denoiser tho? idk

pallid current
#

still feels kinda wrong to me tho

keen pivot
bitter turtle
pallid current
# bitter turtle really? how come?

because it breaks my mental model of how the network is using the features, like in my head it kinda looks like TMOS, you have features + inference, you use the negative bias to screen the interference and then you reconstruct in the same direction

#

if its not doing that then i guess i just dont have a good picture of what's going on

bitter turtle
#

@keen pivot should be in /mnt/ssd-cluster/resid_layer_2_19_07_scaled_l1, feel free to delete the other one

bitter turtle
#

could be some additional denosing/information loss

pallid current
#

i wonder if it would help the model if we added back the bias immediately after the sparsity penalty. like at the moment it adds the bias, RELU, and then has to reconstruct but it'll be missing an amount equal to the bias, so maybe we should just add it straight back on?

bitter turtle
#

conditional on the feature being nonzero presumably

#

yeah could do

#

pretty confident that we are solving a different problem to the model tho, the model can do things lossily/in superposition, while we are looking for perfect replications or whatever

pallid current
#

did we work out what the runs yesterday got such low recon loss btw?

bitter turtle
#

I was scaling l1 wrong

pallid current
#

oooo

#

classic

bitter turtle
#

real

#

@keen pivot are you checking the dicts nw

#

cool

keen pivot
#

Just checked a random one and it looks good

bitter turtle
#

thank fuck

#

dunno how proper ML researchers do it tbh

#

the feedback latency is horrifying

#

like how can you trust your own code to train for 8 months

keen pivot
#

This over 3 different l2-biases.

bitter turtle
#

slightly cropped

keen pivot
#

This also shows what I noticed earlier, which is tied having better MCS. If Aidan is right about LLM solving noisy stuff, then we may have to ignore high-MCS entirely mostly

bitter turtle
#

hmm interesting

pallid current
keen pivot
# bitter turtle how come?

Tied embedding makes there be only 1 solution, as in one set of residual stream dims to read/write from. Untied allows many directions to read from. So maybe two different dicts learn different features if they're untied.

bitter turtle
#

not sure what you mean

keen pivot
#

But if we ignore high-MCS, and just say "if the sparsity isn't insanely high or low & we get good reconstruction loss, then maybe the features themselves are good", which we can check by hand or by auto-interp

keen pivot
bitter turtle
#

sure

#

that weirdly enough often actually works

keen pivot
#

Tied embedding makes the model read and write from the same direction

#

untied allows the model to read from multiple directions and write to only 1 direction

#

model= autoencoder

bitter turtle
#

well, I disagree there, untied allows the model to read from one different direction, not multiple. this might be an important distinction. I think allowing it to read from one different directions is weird and slightly wrong, we should allow it to read from many instead

keen pivot
#

Let me write an example

bitter turtle
#

but I understand what you're getting at

bitter turtle
keen pivot
#

Ah, maybe not.
I wanted to say something like:

Because of the ReLU & negative bias, you can have multiple ratios of residual stream that activate the feature, which are multiple directions in residual stream (though the same direction in weights since those are frozen). For example, for weights (1,1) reading in from the first two dimensions of residual stream, we have:

F_1 = ReLU(w1*r1 + w2*r2 +bias)
F_1 = ReLU(r1 + r2 - 1)
Which can be positive if r1 or r2 is large or a sum of them.

Though writing into the residual stream w/ the decode is always the same direction.
Though this is true for tied embedding as well.

#

Also another counterpoint: We may just need a better l1 value for the untied model.

#

@bitter turtle, would we be able to run w/ a larger encoder/full net soon? I can code it in, though I didn't know what you had in mind.

bitter turtle
#

yep

#

I was gonna do that rn but got distracted

#

pahah

keen pivot
#

Going rock climbing, can look at the models when I get back if they happen to be trained by then!

cosmic moon
keen pivot
#

@cosmic moon Could you give a short blurb on why the linked work is related to this project channel?

bitter turtle
#

I mean, it seems at least tangentially related to the general theme of this channel (searching for modularity in NNs), but not neccesarily directly related to sparse coding particularly.

keen pivot
#

@bitter turtle, I'm running just a basic 2-layer encoder (first to 1/2 dictionary size, then to full) on cuda:0 on the old infra, just fyi. I didn't know how to easily change yours to add a second encoder param.

bitter turtle
#

yep I just did it

#

I'll do a commit with the 2 layer encoder in a bit, just need to test it

#

(I'm doing d_act -> dict_size -> dict_size fyi, don't have any reason to prefer either other than compute)

keen pivot
#

@pallid current , we ever figure out why lee is against a bias for the decoder?

bitter turtle
#

ok, will continue testing after I've eaten

keen pivot
keen pivot
#

An additional thing would be to try a tied embedding on a really big dictionary (32x) w/ a lot of data (or really until convergence)

keen pivot
#

@pallid current , I looked at the hyphen stuff and they're quite interpretable and separable

#

Slight update on the "2 layer encoder": It gets a reconstruction loss of 0.070, which is okay, but not amazing. The high-MCS is also garbage (<1%).

I'm just running the tied embedding w/ much larger dictionaries and for much more data.

keen pivot
#

For the is/are one, some are clearly separable, but like 2-3 others look similar, but I believe they're is/was that only activate in a specific distribution of text (e.g. technical, news article, etc).

I currently don't have the tools to figure that out, because I'd need to know which previous features cause these to activate.

pallid current
#

ellena reid in my seri mats group is working on applying sparse coding to audio transcription models and pointed out that it's a bit weird in our tiedSAEs to normalize the decoder weights after they've already been applied the first time

bitter turtle
#

oh what

#

yeah my bad

#

I guess this could actually make sense

#

like, plausibly the scaling could be different for optimal reconstruction

#

we should maybe look into that

#

also seeing loads of totally dead ones

#

I guess we might have issues where they just optimize for sparsity

#

yeah this looks like not immediately good

#

@pallid current latest commit should be ready for merge

keen pivot
#

For 2-layer, I got 1e-4 as a good l1, but it was still pretty bad & didn't have too great a reconstruction

bitter turtle
#

yeah it might just be insanely brittle

bitter turtle
#

If not I can idm

pallid current
#

yo yeah i'll starting merging now

pallid current
#

merged the new ensemble stuff, currently rerunning some of the graphs in the post with argmax(max) instead of argmax(mean) and still tryna fun autointerp on the new results lol

#

off for a bit but will prob work a bit later

keen pivot
#

I'm just trying to get reconstruction loss down by doing tied w/ smaller l1's to see if that helps, while still learning meaningful features.

#

Additionally, tomorrow I can look into the dataset of predictions that do best & worst on perplexity to see if there's a pattern.

pallid current
#

notes:

  • they define the capacity allocated to a feature i as $C_i = \frac{(W_i\cdot W_i)^2}{\sum_j(W_i, W_j)}$ where $W_i$ is the weight vector in the embedding matrix for feature $i$
    total capacity can be no more than (but can be less than) the total embedding dimension D

  • find that across the model, capacity will be allocated at the point where the marginal value of capacity in that feature is some constant value (if the marginal value were different then you could reduce loss by reallocation). therefore you can only expect to see superposition if there are decreasing returns to capacity, which they say occurs for inputs of high sparsity or kurtosis.

  • asserts a strong relationship in general between sparsity and kurtosis which i hadn't understood before (this seems quite well studied in neuroscience eg https://iopscience.iop.org/article/10.1088/0954-898X/12/3/302, should probably look further since this is 20yrs old)

  • find that you get full capacity if you have a weight matrix which is semiorthogonal, meaning that $WW^T=\lambda I$, (as well just going diagonal of course). can combine the approaches by having orthogonal subspaces - at small dimension this becomes TMOS' polytope model.

sinful shuttleBOT
bitter turtle
#

Oh yeah v good paper, did you have any thoughts on how we could use this? Will check out neuroscience paper

#

hhhh Bristol uni doesn't provide access

keen pivot
# bitter turtle Oh yeah v good paper, did you have any thoughts on how we could use this? Will c...

I definitely need another hour or so to grok the paper, but one part I'd like to understand is kurtosis.

wiki says it's a measure of the (amount of outliers? extremity of them?).

So if we try to intentionally find directions w/ high kurtosis, then, given an activation dataset, we can find these directions by having a measure of kurtosis, then optimizing the direction, and defining loss as kurtosis?

Then repeat and add a diversity term so you find new directions.

#

I also don't understand sparsity & kurtosis being similar (the neuro paper says they're not, except for sampling kurtosis?). Like you can have many outliers & that shouldn't effect the frequency of them?

keen pivot
#

Update: I'm seeing better reconstruction losses for higher sparsity values for tied embedding (e.g. 0.5-0.4, where earlier we had 0.75). This is unsurprising, but I still need to see if there are meaningful features, we didn't learn the identity, & the actual perplexity difference.

bitter turtle
#

Tysm

bitter turtle
#

For instance, on symmetric uniform distribution with additional mass on 0, kurtosis ~ sparsity³

keen pivot
# bitter turtle Not sure what you mean, kurtosis isna measure of heaviness of tails which is lik...

From the wiki:

This number is related to the tails of the distribution, not its peak;[2] hence, the sometimes-seen characterization of kurtosis as "peakedness" is incorrect. For this measure, higher kurtosis corresponds to greater extremity of deviations (or outliers), and not the configuration of data near the mean.
Edit: I see this doesn't talk about your point. Could you explain how you think kurtosis relates to the shape of the distribution?

keen pivot
bitter turtle
bitter turtle
keen pivot
bitter turtle
#

afacit they are making a distinction between two types of sparseness

#

afaict

keen pivot
#

And sorry, I think I'm coming across as making strong claims, but I'm just confused and appreciate your help!

bitter turtle
#

oh no not at all

bitter turtle
keen pivot
#

Maybe you could automatically search for meaningful features by finding directions that match the right graph, more than the left. One measure may be kurtosis, which would be the E[normalized(x)^4], which the right graph has more than the left, right?

bitter turtle
#

the right is basically normal right? I think it has zero kurtosis

keen pivot
#

Oh, it might be a different order on my device. The residual stream one looks normal to me, and the dictionary one doesn't

bitter turtle
bitter turtle
#

basically a 'spike and slab' (i.e. sparse) distribution looks more like the red one than say a normal distribution

#

you can't strictly speaking draw it as a pdf because [][][][] but like 'things with their weight on the tails spread out over a larger area' have higher kurtosis I think

keen pivot
#

Okay, so kurtosis will be lower for rarer features, right?

bitter turtle
#

uh

#

Noooo

#

Don't think so

#

Wait one sec

#

Let me calculate this

#

actually no im really confused

#

might have done my maths wrong

keen pivot
#

Okay, but suppose we have two feature, lolololol

#

If kurtosis is E[x^4], then if you have 1% of values at e.g. 100 after normalization compared w/ 0.01% of values at 100, then the expectation takes that into account and gives different values?

bitter turtle
#

yes?

pallid current
keen pivot
# bitter turtle yes?

So then one will be a rarer feature (e.g. 0.01% compared a/ 1%) which will mean it has a lower kurtosis.

bitter turtle
#

1/sparsity

#

phee

#

phew

bitter turtle
keen pivot
bitter turtle
bitter turtle
#

can do it now

pallid current
#

goes to +inf at x = 0 or 1

keen pivot
#

& excess kurtosis is just regular kurtosis - 3, so that normal distribution is set to 0?

keen pivot
pallid current
#

also that kurtosis isnt just E[x^4], it's the standardized moment, you do E[(x-mean)/std_dev]

keen pivot
#

Correct

pallid current
keen pivot
pallid current
#

ok it doesnt give intuition lol but it does show it!

keen pivot
#

Ah, I see. Gotcha

#

The only intuition I've got is that normalizing causes the effect. If you have more outliers, then the mean is shifted & std is greater, which has a large effect on the ^4 part.

bitter turtle
#

that's why you standardise it

#

similar figure for kurtosis (not excess) vs feature density (where the density (x-axis) is the probability that the feature is nonzero and uniformly distributed over [-1, 1])

#

goes to inf the rarer the feature is

bitter turtle
#

unsure what 'total proportional capacity' would get us, would expect that to be fairly constantly minimal

pallid current
#

and once we do that our set up becomes basically identical to that in the paper?

bitter turtle
#

not exactly sure. In the paper there is at least some idea of the range of activations of features, and it's kind of uniform across features, while with ours we see ridiculous variance in activation

pallid current
#

is that a problem tho? to me it's just an interesting part of our findings (didnt know btw!) because if there are different scales then presumably that creates more interference for any direction with has some degree of cosine sim

#

i think the capacity paper predicts that those directions would have more of a full dimension to themselves

bitter turtle
#

Hmm, yeah

#

I think that by default the absurd activation dimensions (the ones with like 1k max activation) have a full dimension to themselves (1k/(1k + n*epsilon) is basically just 1) even without the predictions of the paper

#

Definitely can look at it tho

#

we shall see

bitter turtle
#

We could just directly measure Expected interference

#

Let $c_{k,i}$ be the value taken by feature $i$ on batch $k$. Then,
$C_i = \frac{1}{K} \sum_k \frac{c_{k,i}}{\sum_j (c_{k,j} * (W_i \cdot W_j)^2)}$

sinful shuttleBOT
#

aidan ewart

bitter turtle
#

I feel like expected interference is kind of what the fractional dimensionality thing is measuring in their paper.

bitter turtle
pallid current
#

isn't this identical to the paper if you take c_i to be the size of the incoming weight instead of activation

#

and adding dot products and activations seems wrong

bitter turtle
#

Wdym 'incoming weight'

bitter turtle
pallid current
#

like i think they're basically measuring expected interference given that the input is uniform or something?? and then normalizing to express it as fractions of a dimensions

bitter turtle
#

Oh yeah sure agree, that's also what this is

pallid current
bitter turtle
#

\times was too many characters

bitter turtle
pallid current
#

so like have some empirical interference measure based on the activations which we can use as a complement to the weight based ones?

#

that makes sense. im gonna go off discord for a bit cos i havent done focused work properly in a couple days, back this eve

bitter turtle
#

since our dict is normed I don't see how we can do a purely weight based one

shell mural
#

Yo, I run a mechanistic interpretability reading group on another server. The topic for next wednesday has been chosen: dictionary learning!

#

We read papers and posts related to given topic each week and occasionally invite the authors of the work we went through to join for a Q&A sorta discussion

#

I invited @keen pivot as a guest speaker, but there's others in here e.g. @pallid current (and possibly more that im missing) that are also heavily involved with the project. Wanna join too?

pallid current
#

What time?

shell mural
#

It's tentatively 1pm EST (10am PDT) by default, then we allow a fudge factor for flexibility on the guest speakers part

#

Logan said he was free at that time, not sure about you guys though

pallid current
#

OK nice yeah I can say hi at that time tho I'm sure Logan's mostly got it covered

pallid current
bitter turtle
#

Pahaha

bitter turtle
#

Nvm got it 😄

#

Would definitely be interested in coming to the Q+A, to get a feel of other people's takes on this direction

bitter turtle
#

does anyone know of a 'one-sided' kurtosis for asymmetric distributions? tempted just to use E[X^3]/variance or E[X^4]/variance

#

hmm, I wonder if I have botched expected interference, these seem low

#

should be approaching 1

pallid current
#

currently running a small auto_interp cycle (40 feats) on all 12 of the non-tied final epoch dicts from tuesday's run

bitter turtle
#

awesome

pallid current
#

some obvious differences in how often some of them just dont have enough nonzero activations. just from eyeballing it loooks like there might be some differences in average score but wont really know at all till i run the graphs

#

will prob need more than 40 to have any confidence tbh

bitter turtle
#

I'd be very interested to know how the results change when you do autointerp on the decoder directions, assumed you were doing that already

pallid current
shell mural
#

Ok sweet! I'll make an announcement on the server and send hoagy an invite

#

Logan also requested I send a google calendar invite, if either of you want one of those too you can dm me your email and I'll figure that out

pallid current
#

results from autointerping 12 nontied 2-dict-ratio dicts:

bitter turtle
#

Can't remember the numbers, those look better?

pallid current
#

general takeaways:

  • l1=0.01 is a nono, dead feats everywhere,
  • l1=0.003 looks noticeably worse
  • 0.001 and 0.003 look basically the same.
  • cant see any obvious effect of the l2 bias, might be worth setting it higher in some runs just to see if that does anything
  • performance looks roughly similar to the original experiment (far left). difference are that these use tied and only a 2x ratio (4x for the original residual stream exps)
  • seemingly doing a bit better on top/top-and-random but a bit worse on random
bitter turtle
#

Ok cool! can do a more fine-grained search tomorrow if that'd be useful

pallid current
#

potentially, though there's a huge amount of stuff still to check: larger dict sizes, how it evolves through the epochs, tied ones

bitter turtle
#

yep yep

#

How should I think about perf on top Vs top-and-random Vs random?

#

do you have any thoughts on that?

pallid current
#

i think random is ultimately the true measure. like if we were able to filter out noise perfectly, and only detect cases where a particular feature was truly active, and then specify it's conditions perfectly then we could theoretically get perfect scores on random, and that's the highest bar

pallid current
keen pivot
#

But lower l1 is better reconstruction, so it’d be good to see the regression to the neuron/residual basis, and just pick the one before it drops off.

Does that make sense?

pallid current
#

i see what you mean but i dont agree.. i guess i think the thing that we're trying to do is find good dictionaries, and i think the quality of the features is probably still going up at that point, even though it'll take a bit of a hit to reconstruction_loss

#

i agree we should do some more finegrained tests tho, i'd say restrict to 5e-4 - 5e-3 in future

#

btw am suddenly getting torch.cuda.is_available() == False"???

#

might have to do a big save and restart the node

pallid current
keen pivot
keen pivot
#

Okay, I've gotten a little better reconstruction loss (.075->.069) by just training on a larger batch size (256->2048); possibly explained by just training on more data, but I haven't checked.

#

Additionally, we have much lower reconstruction losses for lower l1-values; the far left one does have more MCS > 0.9, but all 3 have decent looking distributions overall.

I can look into the low-reconstruct dictionaries specifically to ensure no identities were learned & sample a few features for interp sake. We can also do auto-interp.

keen pivot
#

Yep, it looks like quite meaningful features. Here's the 2000'th highest-MCS one (MCS=0.85)

#

@pallid current how many tokens/datapoints are you using to do top-random sampling for hypothesis generation?

#

I want to communicate my surprise that GPT-4 wasn't able to understand the hyphens had numbers before them when I reviewed it's hypotheses yesterday.

bitter turtle
keen pivot
bitter turtle
#

well, I wouldn't expect a transformer to be entirely describable via sparse coding anyway; NNs implement a bunch of different algos in different ways many of which don't involve sparse features

keen pivot
bitter turtle
#

Imo its better to have fewer high-confidence, really well understood features corresponding to high-importance concepts than to have some bad sparse decomposition of the entire thing

keen pivot
bitter turtle
bitter turtle
keen pivot
bitter turtle
#

well, like in the paper

keen pivot
bitter turtle
#

I think you could 'describe' it with sparse codes but they would be bad and not the best fitting description

#

which is kind of what we aim for

keen pivot
bitter turtle
#

I can get on that ig

keen pivot
#

Lol

bitter turtle
keen pivot
#

Wht does counterfactual testing mean here?

bitter turtle
#

Like causal scrubbing-type-things. Ablations etc.

keen pivot
keen pivot
bitter turtle
#

Oh, right. Yeah, so you can theoretically probably describe anything with arbitrarily sparse codes but you need lots of them and it would be a really complex and bad encoding so kinda similar yeh

bitter turtle
#

I don't really trust autointerp

keen pivot
#

I do like this train of thought. I can see two tasks:

  1. Verify features found by various sparsities (from 5 features/token to 500). This will eventually become the identity.
  2. Given the best dictionary from (1), we can find the greatest perplexity-diff between the original & reconstructed models. (ie run perplexity test on original, then run on reconstructed. Find the datapoints w/ the greatest differences) Those diff points may point towards functions in the model that aren't best represented by sparse codes.
keen pivot
# bitter turtle I don't really trust autointerp

I like how autointerp is able to tell the difference between the basis & the dictionary, so I also expect it to find when the dictionary becomes the basis (when it learns the identity). I do agree that finer-grained measures (ie is this dict better than that one) is uncertain.

keen pivot
bitter turtle
#

yep will try two haven't done this kind of interp b4 will bug you for help in dms if i end up needing it

#

not sure polysemanticity is a meaningful enough term to do anything better than throwing gpt-4 at it tho

keen pivot
#

@bitter turtle desired wandb metrics:

  1. Features/tokens (number of non-zero activations per token on average)
  2. MMCS
  3. Full histogram of MCS

2 & 3 would need to be done every few batches to compare dictionaries learned at different sizes. This may be a headache to do syncing, which if so, maybe just at the end of training.

bitter turtle
#

Could do it every chunk (~2M activations, 2k batches) without much hassle

#

Otherwise yeah

#

Not sure what you mean by 1

keen pivot
#

dict_levels is the latent dimension (ie feature activations)

#

Basically how many features activation for a given token: for the sentence " The cow", it may activate 3 features:

  1. animals
  2. words that start w/ "c"
  3. words that come after " the"
#

Thought: we could do sparse coding on the activations, except for the outlier dimensions.

keen pivot
bitter turtle
#

Np

#

Just was confused if you meant something else

bitter turtle
#

@pallid current haven't used your autointerp stuff yet tbh, is there a function I can just call to get a score out for doing MCS-to-autointerp-score correlation testing

bitter turtle
#

I'm off to bed, if you want to do autointerp on a bunch of the same dict I trained 16x iters for dict ratios 2, 4, 8 for l1=1e-2 on resid stream in multiple_iters_mcs_21_07 @pallid current, otherwise I can try and figure it out tomorrow

pallid current
#

need to refactor interpret.py to make stuff like that easier, maybe over the weekend

pallid current
bitter turtle
#

Gah

#

Which one do you think is best

#

I was just looking at the w&b logs but don't track dead neurons atm

pallid current
#

1e-3 seems a very safe bet

pallid current
bitter turtle
#

Mb forgot that existed 🤦

pallid current
#

here's my writeup from the meeting, focusing on potential tests and metrics:

  • suggests comparing the strength of the ablation effect with that found with neuron basis (i think there might have been more to this but i didn't catch it)
  • should we compare perplexity to perplexity from replacing with non-sparse coding reconstruction with equivalent reconstruction loss, to see if we're capturing more important directions of variation than we would otherwise expect?
  • similarly, do we see surprisingly low reconstruction costs if we only restrict to 1 or 2 layers downstream?
  • can we check whether we're fragile to small quantities of nonsparse data?
  • can we express reconstruction loss as a proportion of the total variance?
  • can we see some relationship between the MLP and residual stream where features that are detected by the MLP are then more visible in the residual stream than they were?
  • can we find examples where the directions we find match up to directions found by e.g. sparse probing (or maybe simple linear combos of a few directions etc)
  • can we find good feature candidates from our dictionaries by selecting for e.g. variance explained, size/frequency of activation etc?
bitter turtle
#

ok i've implemented some standard 'metrics' (MMCS, sparsity hists) on my fork using the shared interface, probbaly needs some tlc to get the plots looking nice, but i have it integrated with the ensembled training code

#

Will also refactor big run to be less cursed/easier to switch to running different experiments if that's something that'd be useful to people

bitter turtle
pallid current
pallid current
bitter turtle
#

Ok, so this is a weird thing i've noticed. Ensembling is less of a pain when you use a functional interface but it seems that people generally prefer OO interfaces for testing etc, and so I'm ending up writing a lot of weird boilerplate to convert between the two.

#

Normally PyTorch hides all this nonsense but PyTorch also doesn't cooperate well with multiprocessing so I have to be kind of bespoke.

bitter turtle
#

@pallid current are you using python 3.8 or 3.10 on the pod?

bitter turtle
#

yeh im just getting some inconsistencies between dependencies etc between our branches like you noticed

#

why did you import GPT2Tokenizer

#

or did you remove that

#

from transformer_lens

#

do you want to do another pip freeze

pallid current
pallid current
bitter turtle
#

afaict not needed, i get errors importing it but w/e

#

@keen pivot did you ever get weirdness where wandb images weren't uploading like half the time

#

oh it just takes ages for some reason nvm ill downsize them

bitter turtle
bitter turtle
pallid current
#

i think what i had in mind was like if we have 10 labels for things that an MLP neuron is doing, we would hopefully see that this feature had more of a coherent direction in the residual stream after the MLP has written back to it

#

and do it by probing for this concept using a synthetic dataset and measure the AUROC and degree of separation

bitter turtle
#

as a side point, maybe we should clarify our language for the paper about features. I know nora for instance uses 'concept' to refer to the high-level human semantically meaningful thing, and 'feature' to mean 'direction in space corresponding to a concept'

#

(i was momentarily confused by the above)

bitter turtle
#

also neuron or dictionary direction?

#

also not sure how this relates to sparse coding, no step of that seems to relate to learned features, i've probably misunderstood you (unless you mean dict direction by neuron? in which case couldn't we just determine that by applying the MLP-post-activation-space -> residual stream and checking their similarity? I guess AUROC is better though, but then I don't see what the counterfactual is)

pallid current
bitter turtle
#

ok but what's the counterfactual? do we ablate those directions in the MLP post-activation?

pallid current
#

i guess i'm not sure what role the counterfactual is playing? like in my head if we show that the direction we've found in an MLP, which seems to correspond to a human concept, also lines up with that human concept being more extractable in the model, then it's extra evidence that we're understanding what computation the model is doing in that layer

#

i suppose our test could be more powerful by comparing to a stronger baseline than no increase in concept-extractability

bitter turtle
#

ok sure

pallid current
#

yeah that would also be a v good idea

bitter turtle
#

i slightly misunderstood you i think

#

I guess like we could also do a regression of the activations of our learned features in the MLP activations and compare that regression's performance to the linear classifier on the residual stream, or something similar, otherwise im still confused as to what you have in mind

pallid current
#

ok test i have in mind is:

  • pick a concept which we think represents one of our learned features in an MLP
  • use gpt-n to create a synthetic dataset of whether that *concept is on
  • run a linear classifier on the residual stream before and after the MLP for predicting labels of the synthetic dataset
  • check whether there is a jump in performance after the MLP
  • check whether this jump goes away if we ablate the learned direction in the MLP
bitter turtle
#

still not entirely sure what 'checking if there is a jump in performance after the MLP' gets us, but I'll get to implementing that

#

other than the synthetic dataset generation

bitter turtle
#

also we can probably ask #1102791430549803049 or some other people about what classifiers/metrics/statistical methods are good to use if we want to enpaperify this

bitter turtle
#

Was Pierre's intitialisation stuff particularly important or not really?

bitter turtle
bitter turtle
#

It might be useful to plot explained variance vs achieved sparsity maybe

pallid current
#

damn, crushing it! will be back on it on monday properly but will have a look now

pallid current
# bitter turtle Was Pierre's intitialisation stuff particularly important or not really?

i've never really understood the case where convergence speed was the constraint which led him to do the initialization stuff, and anyway i understood that it mostly helped the first part of training rather than getting the last bits of performance so i think it's unlikely to be useful. i think in certain toy models it was really slow but in real models it hasn't seemed to be necessary as well see with like good results from models trained in like 15 min. i think it'd be good at some point to run some like 100, 1000 epoch models and check whether there's increased performance tho

pallid current
#

made a few little changes to interpret and the save code and it now seems to be able to interpret using the new arch

#

tho i think that it would be best if the outputs were saved in individual folders by default, just makes the processing a little bit easier

#

will hopefully finally get to running more tests on the sweep tomorrow morn

bitter turtle
#

maybe interesting plot

#

This is for 32 L1 coefs * 3 dict ratios * 2 repetitions for each

#

On residual stream

#

This may have already been done before idk

#

but the implications feel pretty important

bitter turtle
#

same thing with pythia-160m layer 7 (wondering if there would be a meaningful difference between hopefully-quite-different parts of the model)

pallid current
#

especially surprising that there's little obvious benefit to the larger dicts

#

how long were they trained?

bitter turtle
#

yeah I feel like this is a decent summary stat for comparing different approaches; not sure what literature there currently is on this tradeoff

bitter turtle
#

one epoch

bitter turtle
#

I really want to compare to synthetic data now, and synthetic data with some noise

pallid current
# bitter turtle pile10k

is that like 2m activations?? should be a solid amount but would be interested to see if more changes anything

bitter turtle
#

yeah about that i think

#

lemmie check

#

about 1.5m

#

I mean it seems to converge pretty fast

pallid current
bitter turtle
#

yeah not sure if that's that significant

#

more testing with absurd dictionary sizes is probbaly called fro

#

for

pallid current
#

wonder what happens as you take the size down

bitter turtle
#

hmm

#

why

#

I guess we might be able to better predict stuff if we can see

#

how the frontier changes with dict size and extrapolate?

pallid current
# bitter turtle hmm

to see if there's a clear point at which the marginal dict element stops adding value

bitter turtle
#

oh, for sure

#

I'll train some more now i've still got the data etc

pallid current
# bitter turtle Each model in own file, wdym?

yeah i'm imagining that each run (which should have it's own folder as well cos it overwrites by default) gets saved as a folder which has the model in and then can also contain any additional data about that run, + autointerp

#

tho i do see why for like the l1sweep you just did it's easy to just keep them together, like its a question of whether we do more work on them as separate entities or as part of a larger collection

bitter turtle
#

I feel like it's not that much of an issue to save it as one big file, but I have moved the hardcoded config out of the sweep function (including output folder)

#

not sure it's that much of an issue, might be nice to store metrics with the models, but we can still just save it as a big list-of-tuples

#

seeing one gpu being weird, not sure what pytorch internals take up compute wise but hmm

pallid current
bitter turtle
#

could just put a filename/folder name/whatever

#

with the model

#

also, yikes yeah

#

that made me grimace

bitter turtle
#

should pbbly do tests with different underlying dataset sizes to see the relationship but it seems pretty unchanging

#

fadedness is l1

#

because it looks nice

#

kinda hard to distinguish

#

but idc

#

Probably going to implement dead neuron resuscitation tomorrow and see if that changes anything

#

(like, resurrection for the first 5 chunks or something)

#

Hmm

#

Idk someone else theorise

pallid current
#

looking at the same data you had aidan i'm getting the sense that it probably hasnt converged at 10 epochs in terms of maximum sparsity for a given level of unexplained variance

#

or at least there's a jump from like epoch6 to 10

#

weird plot but in those lines of dots what you can see is the sparsity / unexplained variance tradeoff getting better at each epoch

#

might be because we're repeating data tho, i wanna try this setup but with fresh data

keen pivot
#

Looks like interesting stuff here, but I'm currently out of commission due to dental work today. Will hopefully be able to catch up/respond tomorrow. Will be meeting w/ Daniel M. today to talk about outlier dimensions.

bitter turtle
pallid current
keen pivot
bitter turtle
#

If it's just the models from my run that's 11 chunks cool ok 👌

pallid current
#

oh right is that one pile10k epoch?

bitter turtle
bitter turtle
pallid current
#

ok that's encouraging cos it looks like there's a fair way to go in terms of performance if we crank up the data

#

it would be brilliant if we could consistently associate a point on the sparsity/explained variance space with a level of interpretability

keen pivot
bitter turtle
#

If you want to run autointerp on all-or-some-of-those that would be Cool

bitter turtle
pallid current
keen pivot
pallid current
#

i guess if we're doing regressions over it, it doesn't matter that much if the individual measurements for a dict are noisy

keen pivot
#

Is there a way to measure unitaryness of our encoder matrix?

bitter turtle
#

I don't think we need to.

#

I don't see anything particularly special about unitary matrixes vs other non-sparse dictionaries

#

Like, sure, it's probably more optimal to learn unitary matrices at lower L1 values but that's kind of just emergent. Like I don't see the causality going from unitary -> uninterpretable directions, it's more low sparsity requirements -> entangled features and low sparsity requirements -> unitary matrices.

keen pivot
bitter turtle
pallid current
#

i dont think we can meaningfully talk about unitary non-matrices when the number of features exceeds the activation dimension

#

and we know from the fact that the sparsity reaches those v large levels that its not just a unitary matrix + a load of empty rows

bitter turtle
#

Could still plausibily be a some rotation-ish-type-thing of that but I don't think it's a particularly good line of inquiry

pallid current
#

rotation of zero would be zero, but yeah i'm also not that interested (edit haha fair)

bitter turtle
#

Haha

#

Maybe we should actually look at annealing LR

pallid current
bitter turtle
pallid current
#

been working on getting the synthetic dataset for ablations done, havent had that much time today but getting there, off for a quick bday drink, will finish in the morn 🙂

bitter turtle
#

@pallid current all dict sizes have similar numbers of neurons that fire at least once every 10k samples at a given sparsity level; this probably makes comparing autointerp between different dict sizes easier, but is also probably something we definately don't want to be happening

#

one sec let me loglog this for readability maybe

#

uh this is slightly weird

#

I feel like my data is noisy ill up it to 100k

#

okaaaay

#

that... did not change the y scaling at all, the higher sparsity dicts are just learning the same/lower numbers of features/zeroing more features

#

I feel like we may see some changes in these plots comparing tied & norm vs not

keen pivot
#

I'm going to write some simple code to find outlier dimensions so we can ignore them, though I may need @bitter turtle 's help for integration (because I'd like to see variance explained vs sparsity).

Current thought:

  1. find outlier dimensions
  2. change mlp_width by # of outlier dims
  3. Index by outlier dimensions when running data through

For variance/sparsity: the outlier dimensions will be 100% explained but account for a permanent +1 in sparsity for every outlier dim.

If this does improve things, we'd need to compare w/ leaving out random dimensions as opposed to outlier ones

keen pivot
#

For Pythia_1.4b, there are several dictionary features that activate a large fraction of the time. Percentages of non-zero activations:

tensor([0.5632, 0.4849, 0.4426, 0.2947, 0.2258, 0.2044, 0.1533, 0.1482, 0.1231, 0.1150])
So the first two activate half the time.

keen pivot
#

Replicating some "ablating outlier dimensions effects model performance a lot relative to other dimensions", I ablate the top-10 outlier dimensions (ie the residual stream dimensions w/ the highest activations) both on their own & cumulatively. The majority of perplexity diff is caused by ablating the first two dimensions.

Notably, ablating the first two dimensions together causes worse performance than ablating each individually, meaning they have overlapping mechanisms in the model (which mechanism, who knows).

#

What's important for dictionary learning is I can just keep the first 2 outlier dimensions (& hope those dimensions don't also do feature representation), do dictionary learning on the rest of the dimensions.

We can also see what happens if we do dictionary learning on other sets of outlier dims (e.g. just the top outlier, top-5, etc)

keen pivot
pallid current
#

script is same folder, frequency_plot_h.py

keen pivot
#

Ah, I think it's frequency_plot.py

bitter turtle
keen pivot
#

This is Pythia-70m, but need to check against normal runs

bitter turtle
#

Yeah wasn't expecting it to be too significant; could you put the normal runs when you do them?

keen pivot
bitter turtle
#

not an error, just a vmap warning, you can suppress warnings with python -W ignore filename.py, also about 15m maybe?

keen pivot
#

The "remove outlier dimensions" one is surprisingly much worse. Maybe a coding error on my part.

bitter turtle
#

Are you still considering those dimensions when calculating unexplained variance?

#

like, how/where are you ablating them?

keen pivot
#

I'm not considering them when calcuating

#

I just remove those dimensions from the batch every time.

#

I'm unsure how much adding back in the outlier dimensions for unexp var. will improve, but the results look a bit dramatic

bitter turtle
#

how aremyou removing those dimensions?

#

like, projecting the dimension to zero?

#

Could you send the code?

bitter turtle
#

also @pallid current @keen pivot are you guys currently using sparse_coding_aidan for work? It would make sense to have your own (stable) copies of them if so

keen pivot
keen pivot
# bitter turtle Could you send the code?
sample = dataset[sample_idxs].to(torch.float32)
indices = torch.tensor([i for i in range(sample.shape[1]) if i not in outlier_dimensions])
sample = torch.index_select(sample, 1, indices)
bitter turtle
#

as of just now yes

bitter turtle
#

probably equivalent

keen pivot
#

They're near equivalent. The only difference is the shape of training across time (the left one is outliers out, and right is original)

keen pivot
bitter turtle
keen pivot
bitter turtle
#

for free?

keen pivot
# bitter turtle for free?

Because I just added the outlier features back in, so it's just adding in an extra 2 0-dimensions which effects the .mean()

bitter turtle
#

doesn't look like enough

#

my hypothesis here is that the model dedicates a couple-or-more features to learning these outliers slightly noisily, and looses performance on them. more interested in training where you train the dict on not those directions

keen pivot
#

Ah, well the performance gain isn't great. Of course you'd expect it to perform better initially because the "remove outlier" one perfectly reconstructs the outlier dimensions, while the other one is still learning to represent them.

keen pivot
bitter turtle
#

to cause the change just by adding in the 0s

#

should I copy the directory for hoagy?

keen pivot
bitter turtle
#

sure

#

just eyeballing it

pallid current
bitter turtle
#

cool

#

I moved to sparse_coding_aidan_new until @keen pivot can move then I'll move back

#

don't want to overwrite your files etc

keen pivot
#

They look very similar

#

I'm done!

bitter turtle
#

np

bitter turtle
keen pivot
bitter turtle
#

oh ok cool ill delete the old folder

pallid current
#

think i've got the code in place to do the ablation tests but current being blocked by getting super bad outputs from the simulation which is odd

bitter turtle
#

oh cool! good luck with that that sounds awful to debug

#

doing some runs with sparse activations + symmetric noise to see if I can get anywhere close to replicating the weirdness seen here; if not I'll scale down my ambitiousness and test for fragility etc etc etc

pallid current
#

toy data?

bitter turtle
#

yes

#

promising just need to find scale hopefully but im off to bed

pallid current
bitter turtle
#

not sure wym

pallid current
#

like text of [john raised $ 6 million], activation [0, 0, 10, 0, 0], gpt4 interp: "currency symbols, esp $", gpt3.5 simulated data: [0,2,9,9,2]

#

gets a score of like 0.3

#

so the explanation is basically perfect, but the simulated data is pretty crap (actually worse than impression in this example)

#

but it's not like there's a load of free params that go into the simulate function, its super simple

#

also seems like it should be trivial for 3.5

#

hahaha i got such a fright when i tried to look at the prompt as it's built internally because it looks dreadful e.g. six\tunknown\n-\tunknown\nyear\tunknown\n deal\tunknown\n but\tunknown\n Str\tunknown\nud\tu but they just add looads of unknowns into the prompt and then get the logits at every position, its super weird

#

but as far as i can tell it's creating the prompt correctly

#

maybe gpt-3.5 is just quite crap at the task??

#

annoyingly i can't switch to gpt-4 super easily, the instruction models use a different endpoint

keen pivot
keen pivot
pallid current
#

(whoops forgot to hit enter) this is for an experiment we'd like to have to verify that the model is in fact using the directions we've found to compute those features. want to check that

  • the feature is more clearly separated in the residual stream after the MLP layer with the feature than before
  • this is no longer true if you ablate our feature
#

and for this we need a dataset of 'is the feature on' which is basically what the autointerp stuff already has

bitter turtle
#

Is this the same thing as before?

bitter turtle
#

had some ideas about using residuals/skip-connections in the multi-layer autoencoder, found out that it is basically this http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf, testing now, looks GREAT so far, some training instability, but I think I just need to vary the lr for the encoder + dict separately (training both with the same LR atm, probably shouldn't, or should train both for a while then freeze dict and finish training the encoder)

bitter turtle
#

@pallid current are you running interpret.py? there is ~2GB on GPU0 and I presume that is you

bitter turtle
#

forgot to norm dict 😅

bitter turtle
#

this is on pythia-70m layer 2, looks much the same

#

(except for the insanely low-sparsity case)

#

lemmie get a comparison real quick

bitter turtle
#

cool, still significantly worse than normal, but at least it is actually training now, unlike before

#

I feel like this is a slight step forward

keen pivot
bitter turtle
#

well, we couldn't get it to converge at all before, and it is now

#

ideally we want to have a number of different approaches we can try out to find the best one, and this one is pretty close in perf to our best one

pallid current
bitter turtle
#

we have the reading group thingy today right? in about an hour? @pallid current @keen pivot

pallid current
#

@bitter turtle have you looked at what the sparsity / unexplained var graph looks like for non noisy toy data? would be good to be able to show the difference between that and the pythia results, especially if there's a very clear difference

bitter turtle
pallid current
#

cool nw

bitter turtle
#

can we also do a restart at some point soon, I think there's like 2GB orphaned data just chilling on GPU0

keen pivot
#

Do you suppose models across time have more superposition? It's gotta learn it sometime, so maybe we could measure it somehow w/ dicts or maybe a tool more suited for it?

#

@bitter turtle , What's the experiment w/ adding noise? (or is this the de-noising encoder?)

bitter turtle
#

i can check in ~5m wasn't last time

pallid current
#

top looks pretty empty now, i did leave a pdb open overnight so i think it was that sozzz

bitter turtle
#

yep looks good now!

#

nw

keen pivot
bitter turtle
# keen pivot What is "the weirdness seen here"?

like, we should expect dictionary learning approaches to be better/there to be a range of l1 values converging on the same solution under the assumption of the activations being well-described by a sparse basis

keen pivot
#

Although they have more nonzero activations (higher sparsity), this could just be allowing more noise.

#

Does that make sense?

bitter turtle
#

yes, for sure, but we don't have access to the ground truth so we can't compare to that

#

oh, sorry, I see what you mean

bitter turtle
keen pivot
# bitter turtle not sure about this

Agreed. This (ie higher sparsity is just more low-activating noise, not a significant difference in features found) is just one hypothesis. The MCS across different sparsities may better capture this.

bitter turtle
#

I think that we can't really say until I compare to the curve for truly-sparse synthetic data

#

if it turns out that the same curve exists for that, the metric probably can't be used for this kind of analysis

#

I do think that the metric is useful from a pragmatic perspective; in the abscence of one true ground truth, more sparse decompositions are maybe intrinsically valuable lenses to view activations through

keen pivot
bitter turtle
#

yep maybe

keen pivot
#

You could have the sparsity/variance explained graph, but also track MCS across dictionaries. If models do have high-MCS w/ nearby sparsities, then that's good evidence for them converging on the same decomposition.

bitter turtle
#

good idea

keen pivot
#

Thanks!:)

#

I'm currently working on dictionaries across different layers. I could code up the MCS across layers one w/ your repo tomorrow.

#

I think I'd also want to train models on more data (like 30 chunks w/ pile?), but I think y'all had experiments showing more training data didn't really affect the sparsity/variance explained?, but that's different than MCS.

bitter turtle
#

latest results from using the more complex, multi-layer denoiser; looks almost slightly better than dictionaries? number is #chunks

#

changed initialization + switched to GELU

#

will do no-noise test w/ synthetic data + normal methods next

keen pivot
#

more complex one is on left?

bitter turtle
#

Yep

#

4x dict ratio, haven't tried larger ratios yet

keen pivot
# bitter turtle Yep

Oh, sorry, I forgot discord might change the order of images

More complex one has highest sparsity ~400 on x-axis?

bitter turtle
#

More complex one is the one with 8 plots

#

Which dicts are those?

keen pivot
#

This is just MCS for Pythia-70m residual across multiple layers

bitter turtle
#

Oh, sickkkk: what do the same layer ones look like?

keen pivot
#

Ah, it'd be a perfect match? I don't have two same-sized trained dicts of diff-initializations to compare against

bitter turtle
#

I think that's a really important baseline to explore

#

Have you tried the testbed thingy yet?

keen pivot
bitter turtle
#

The standard_metrics.py thing; using a common interface for many different dict types

keen pivot
#

I've seen that file, but currently still in my old repo since these dicts are pickles

bitter turtle
bitter turtle
keen pivot
keen pivot
bitter turtle
#

yep

#

Back on sun

keen pivot
#

Looks like Hoagy thumbs-up reacted to it, so I'm assuming he's got it handled!

pallid current
pallid current
keen pivot
keen pivot
pallid current
keen pivot
#

Would be interested to compare the two when training for 10x longer

bitter turtle
bitter turtle
#

Whoops

bitter turtle
keen pivot
#

For MCS across dicts of different sizes (as a baseline that's better, but not as good as dicts of same size/diff init). Notably layer 5 is sucks. Also, layer 2 was trained differently than the others, but I don't have the hyperparams or amount of training data on hand.

pallid current
#

is layer 5 just before it goes into the unembedding matrix?

keen pivot
#

Yep

pallid current
#

that's very interesting, especially that it's not bimodal

#

someone asked me about this recently, like how confident are we in the model where the residual stream is written to and read from, while the mlp calculates updates to the mlp, but doesn't really hold much of the information

#

i think the success of the logit/tuned lens is the main point of evidnce for it but i couldnt think of much other emprical evidence

#

i think if that were true and sparse coding was working perfectly we would expect bimodality

bitter turtle
#

not sure what it would mean for the mlp activations to hold information in a way that would meaningfully invalidate that model

#

the residual stream is the data moving between layers

pallid current
#

so like, when you calculate resid4 = resid3 + mlp3 + attn3, the assumption is that the mlp does not contain most of the information in resid3, its instead calculating smaller volumes of new information. and therefore, for information to persist between layers, it must be present in resid4 in the same form that it's present in resid3

#

whereas, if mlp3 contained most of the info, there would be nothing preventing resid4 from having a very different representation to resid3

bitter turtle
#

right, volumes of information, I see.

#

I feel like the distribution of bimodality across layers is very dependent on the way the gradients flow through and I can't model that that well; I think there was a paper looking at something ~this

#

You should ask the tuned lens people

#

how many reinitialisations have you done @keen pivot

#

I've asked in #interpretability-general

keen pivot
# pallid current so like, when you calculate resid4 = resid3 + mlp3 + attn3, the assumption is th...

My brain is sliding off this. What are the two different hypotheses being compared here & the evidence for either (and bimodality of what? MCS across layers?). My attempt:
H1: residual stream is the memory of the model that is written to & read from by e.g. MLPs. Each MLP only does a small change to the residual stream, so the representation should be mostly the same across layers.
H2: Most of the information goes through MLPs, so we shouldn't expect similar representations across layers

#

Oh, and for the record, I also worked on tuned lens

bitter turtle
#

oh sick

#

oh, logan smith

#

is that you?

bitter turtle
keen pivot
#

Okay, the across dict stuff looks like I need to do ACDC ablation stuff:

  1. Find a cool feature in layer 5 (check)
  2. Ablate all features one-at-a-time in layer 4 & sort by drop in feature activation in 5 (could also ablate all features, and then restore one-at-a-time)
  3. Investigate those features found
  4. Repeat for features found in layer 4.
bitter turtle
#

please do that that would be fucking sick

#

as in, would look amazing on a paper

keen pivot
#

I should be able to do in like 4-5 hours, which will be a tomorrow thing. I'm about to go watch barbie though.

bitter turtle
#

it's v good enjoy

#

@pallid current ran the test on no-noise, all l1 values (tested over the normal range here, same as the last one) converge to low-sparsity solutions (btw the sparseness measure I built for wandb is broken atm) as we predicted. falloff for lower-sparsity solutions is similar

#

this is Quite Weird.

#

like, the really low l1 ones still converge to sparse solutions, I think this is maybe just a search-not-wide-enough thing

pallid current
#

riiight so in that case the sparsity at about 7.5 is like the true amount?

bitter turtle
#

still, I think it is ~what we were expecting

pallid current
#

ok so yeah seeing super big differences in the role of noise

bitter turtle
#

phew

#

this is progress then

#

I had an idea for avoiding noise; instead of peanalising l2-norm of residuals, we should peanalise cross-entropy against a normal distribution with a learned covariance matrix (and also probably peanalise the size of the covariance matrix)

#

this might probably turn out to be equivalent, but who knows. probably people do, but I'm not them

#

I expect this to be equiv

pallid current
#

hmmm so the only loss signal would be from our ability to fit to this learned cov matrix?

#

so basically at that point the learned cov matrix is assumed to encode all of the important information?

bitter turtle
#

yeah, this is assuming that 'that which is not sparse is ~normally distributed'

pallid current
#

i feel like that would throwaway most of the important info, because it wouldn't be able to represent spikes in the distributions properly. like openai's approach is basically to take maximum distance away from the normal

#

and i think you'd still want to do that even if you had a learned cov matrix

#

tho i guess that's the diff

bitter turtle
#

oh, so, we learn a sparse dict, but replace the 'minimise residuals' thing with 'make the residuals fit a normal distrinution with low variance'

pallid current
#

riiiiiight sorry i getcha

#

seems rather convoluted but it guess it could work. for me the question is, even if this is what we'd expect to see in practice, is there a reason that we think that minimizing l2 norm is the wrong thing to aim for?

#

like there are cases where fitting the normal dist is wrong, because that variation can in fact be captured

#

whereas i'm struggling to picture the case when performance would get worse by trying your best to minimize l2, even if some noise is irreducible

bitter turtle
#

yeah pretty much agree

pallid current
#

interesting idea tho, shades of VAE about it which also made me go ?? at first but works

bitter turtle
#

desmos crushes my dreams once again

#

this is literally quadratic

#

I guess this should be expected, linreg people know what they're doing

pallid current
#

linreg mafia undefeated

bitter turtle
#

for real

#

I could literally have done this in my head im dumb

bitter turtle
#

might be slightly more stable to train/converge better otherwise no ideas atm

bitter turtle
bitter turtle
#

tied vs untied complex multi-layer denoiser; seems about the same perf

pallid current
bitter turtle
#

yep

pallid current
#

btw got it working so i can run interp over a gigantic .pt of learned dicts

#

still the q of exactly what to run it over

bitter turtle
#

awesome

#

tied dicts are a bit weird on this ???

pallid current
#

?? what's the diff i thought the last graph was also tied

#

looks buggy

bitter turtle
#

I got that converging correctly this morning, it gets slightly better performance than normal at the cost of training speed

bitter turtle
#

fixed the bug

#

(wasn't telling the dict to be normed - again - when I was saving them) @pallid current

#

meant I lost first batch checkpoint

#

cool, that looks to be slightly better than untied!

bitter turtle
#

pushed current code, off for the weekend

pallid current
#

have fun!

pallid current
pallid current
#

ugh getting cuda.is_available = False again :/

#

no idea what's triggering it

pallid current
#

will restart in the morning if it's not magically fixed and have messaged curtis

#

feel free to restart if you need logan

bitter turtle
pallid current
#

oh hey, gotcha

#

btw is the dense_l1_sweep output saved somewhere, the untied ones?

bitter turtle
#

not atm

pallid current
#

ah k

bitter turtle
#

p sure you can just change the call and run it tho

#

Shouldn't be too much reconfig

pallid current
#

yeah will do once i restart the kernel

#

tho in general would best to save exp funcs cos there's quite a lot of params just set in __main__

bitter turtle
#

sure

#

I'm just using pythia-70m layer 2 residual for everything ATM, literally just the data in activation_data

#

it's like the older training setup in that you can lie to it and just change the dataset folder and it won't check that it's the right dataset folder

bitter turtle
#

I also doubled the batch size, I trained the other ones with 1024

keen pivot
bitter turtle
#

that's not even a bad idea wth

#

That would be mildly funny if that worked

keen pivot
#

It's also sounding more like the soft-prompt literature, which began simple, but expanded to more transformer-like models to train the soft-prompts.

bitter turtle
#

I mean, you probably actually wouldn't want to do that, you need to propagate magnitudes through, and +ve activations definitely aren't centered, but it would be funny

bitter turtle
#

layernorm would kill that

#

I don't think the trade-off is good enough to spend a lot of energy looking into it though, maybe if we get stuck after training on lots of data

#

like, potentially once the dict converges with linear encoders we could freeze it and train this to do better sparse coding for that dict, and maybe iterate, but that's a while off. More interested in looking at the shittons-of-data case atm

keen pivot
bitter turtle
#

yes, we should do it soon

keen pivot
#

Got a graph of related features. The original one is words in parantheses related, & the others appear to be similar as well. Will look into the details soon

pallid current
#

😮

keen pivot
#

I also want to cluster feature directions as well. In general, and here we could color-code features by their similarity, because some of these may just be the same direction across layers.

pallid current
#

currently running a 100-chunk sweep on the pile with the dense l1 sweep

keen pivot
#

Note: this too 6 minutes, which isn't too long, but will get longer w/ larger models/more layers

pallid current
#

how are you running this?

pallid current
#

definitely seeing diminishing returns, up to 20 epochs atm and it's still improving but barely

#

continued improvement is most noticeable at low tokens per activation_v tho so might still see a decent jump by 100 epochs

#

this is with dict_ratio = 4, so we can also see if this looks different with higher dict ratios

bitter turtle
#

what is 'low tokens per activation_v'

pallid current
#

just low sparsity (on the graph) , except that is what we'd usually call high sparsity so its confusing

bitter turtle
#

yep

pallid current
#

clearly i didnt improve the situation 😆

bitter turtle
#

call the x-axis thing 'sparsity number' maybe

pallid current
#

think i might go for 'average active features'

#

dont know why i keep calling features tokens

bitter turtle
#

yeah that was confusing

keen pivot
bitter turtle
keen pivot
bitter turtle
#

Didn't someone speed it up by like 200x using activation patching or whatever it's called recently?

#

Think they were in the UK seri mats cohort

keen pivot
#

Looks like overall, this direction in the last layer wants to up-weight an end-paranthesis, and there's two paths: end of acronyms & end of dates after an opening paranthesis

bitter turtle
#

are there established metrics for goodness-of-graph? thinking we could use something atticus gieger-like to measure how descriptive graphs we find using the sparse basis are compared to how descriptive graphs we find on the neuron basis

keen pivot
bitter turtle
keen pivot
bitter turtle
#

for sure want to connect to MLP as well tho

pallid current
bitter turtle
#

I think he has a metric for 'alignment of causal hypothesis to model' and so basically we

  • find a graph with adcd or whatever the acronym is
  • come up with a hypothesis for what each node in the graph represents in an abstract computational model of the circuit
  • throw the metric at it, which compares the abstract model to the model in the transformer
#

if we can find 'natural' circuits/graphs that are well described by abstract high-level causal models that would literally be fucking insane

#

I was thinking today about what kind of things we really want for a draft paper if we go that route and demonstrating the ability to find circuits in models using the sparse basis is for sure up there like number 1 priority

keen pivot
#

I guess it's the "computational" part that doesn't work, but I do think you can still do causal alignment here.

#

Expecially w/ the speedup, I could quickly find really good examples to test.

#

Slightly different graph: This is for feature restoration. Basically, I set the activation to 0 & check how restoring that feature accounts for the original activation.

bitter turtle
#

Also, we should probably be using FISTA/OMP etc for doing compute stuff

#

Or like, generally some better solver than dot + bias

bitter turtle
keen pivot
# bitter turtle What do the % here represent?

Percentage recovered or ablated. This case percent recovered.

So if the original activation was 5, and we ablate everything and recover one feature, how much of the original activation do we recover?

pallid current
#

ran another long run with l1 sweep, this time tied, seems very definitively no difference, to the point where i'm checking i'm not just plotting the same data twice (don't think i am) (crosses are tied, legend is n_chunks)

pallid current
#

set off a big run to analyse 50 feats from all 32 tied l1 values, about $350, no idea how long it'll take, i guess a few hours?

bitter turtle
pallid current
#

autointerp results from the sweep are odd and kinda worrying. not seeing the rise from low to mid l1 values that i expect, or at least not as robustly. then goes off the rails at we get to high l1 vals. need to adjust the approach to increase the number of features analysed to correct for the fact that most will be dead or v rarely active

#

need to run l1 = 0 runs to see if they are much worse than l1 = 1e-4

#

i'm quite worried by the fact that 1e-4 is so good

pallid current
#

have left a run ongoing while going to bed, will almost certainly hang at the end due to wandb issues, if its preventing anyone from doing a run, just kill

keen pivot
bitter turtle
#

Ah, but to measure it we need to

  • make a choice of which graph to look at
  • make a hypothesis of which algorithm the graph implements
    which seems hard to make standardised when comparing neuron basis and sparse basis
#

I mean you can probably standardise the first one fine

#

Hmm maybe it's ok

keen pivot
keen pivot
#

Im also making different choices to make the graph (eg ablation vs restoration), which can be compared against each other as well

pallid current
#

trend seems about right if you just look to l1=1.6e-4 but those last few ones around 1e-4, esp 1e-4 are just weird

#

about to run 0, 1e-7, 1e-6 and 1e-5

#

btw im pretty sure that we're overloading wandb when we do our final upload, leaving it to timeout indefinitely, i think for big runs for now we should just turn it off - i haven't been looking at it at least

keen pivot
#

@pallid current, would you also be able to see MCS between different l1-value dictionaries? If there are N l1-values in the graph, this could be shown as the lower-triangle of an NxN matrix.

#

I said I'd get to it yesterday, but doing the causal alignment stuff atm.

#

I also want to do the outlier features, but out-of-scope for this project. Might try to pawn it off.

pallid current
pallid current
#

ran the mmcs matrix, got some very simple plots running in the notebook mmcs_plots.ipynb in my workspace

#

most interesting thing i can see is that there's some kind of peak around l1=0.001 where even the highest l1 values near l1=0.01 match most closely to 0.001, i guess because those are the features they would learn, if they weren't mostly dying

#

peak mmcs in the whole matrix is just above 1e-3

#

low mmcs match best with other low mmcs though there is a noticeable hump around 2e-3

#

oh also got the results back for the super low l1 baselines. they dont outperform neuron_basis on random but do slightly for top and top-random, so there's something just by the nature of putting the data through a relu that is causing some level of screening, need to be careful about this when making claims and baselining

keen pivot
pallid current
#

heat map, tho i think it's quite hard to read

bitter turtle
#

Do you have a legend?

pallid current
keen pivot
#

There do appear to be two clusters here

hallow wyvern
#

Hey, coming here from the mech interp discord where you presented this week. Amazing stuff.

I was looking into using your techniques for interpreting the TinyStories series of models, but as the first step in doing that I'm trying to come as close to reproducing your results with your codebase as I can. Couple of assumptions I'm making on my end while trying to do that which I am not sure are correct:

  1. The canonical repo for this work is https://github.com/HoagyC/sparse_coding.git : This one seems to be ahead of all of its forks
  2. The canonical way to run this code is to run python run.py <args> : This is what it says in README.md, but I do note a lot of recent activity in the files big_sweep.py and big_sweep_experiments.py. I also see changes to interp_notebooks/feature_interp.ipynb that are more recent than the changes to run.py, and the definition of class AutoEncoder is different in each one.
  3. The canonical way to verify that the artifacts are finding real feature directions is to run interp_notebooks/feature_interp.ipynb.

Asking because I did a run with python run.py --epochs=3 --save_after_mini=True --l1_exp_low=-14 --l1_exp_high=-10 --dict_ratio_exp_low=1 --dict_ratio_exp_high=7 --layer=2 --use_residual=True --use_wandb=True --wandb_entity=<my_name>, and it did generate outputs in outputs/20230729-195836/0/auto_encoders_2.pkl, but the results of interp_notebooks/feature_interp.ipynb seemed a bit off (after I did some extremely sketchy stuff to get that notebook to run at all).

Not urgent, I'm trying out seeing what happens when I use autoencoders.tied_ae.AutoEncoder in run.py to see if that helps

pallid current
#

hey 🙂 yeah that's the right repo and those arguments look reasonable. it's in a funny state because run.py was the original code to run but we've been doing a lot of large hyperparam sweeps with mutiple GPUs and Aidan rewrote the code with a very different architecture so i barely know what run.py does at this point, i'm happy to talk through for a bit if it's not really working

#

we should do a cleanup that allows a simple run soon

#

but if l1 and reconstruction loss are both falling then it looks like it should be working ok but i don't knw the status of feature_interp.ipynb, @keen pivot can you help?

hallow wyvern
#

If sweep(ensemble_init_func, cfg) is the up-to-date method for running one of these experiments, I can write an ensemble_init_func 🙂

hallow wyvern
#

well that looks promising

#

https://github.com/HoagyC/sparse_coding/compare/main...JoshuaDavid:sparse_coding:main#diff-9ef165e04d52c6850bd88e31d401d1eb218baf46492f5974292eb6d32f739350 is my initial crack at an ensemble_init_func which will do the same thing the old run.py did (minus the "compute and store activations if none exist" step). Currently running via

$ python run_using_sweep.py --epochs=1 --save_after_mini=True --l1_exp_low=-12 --l1_exp_high=-10 --dict_ratio_exp_low=1 --dict_ratio_exp_high=4 --layer=2 --use_residual=True --use_wandb=True --datasets_folder=activation_data/pile-10k-EleutherAI/pythia-70m-deduped-2/ --wandb_entity=joshuadavid

ETA is ~7 more minutes (I'm running on an instance with only one GPU)

Edit: there were two chunks. ETA is actually still <t:1690696320:R>

hallow wyvern
#

Run finished without errors at least

hallow wyvern
# pallid current hey 🙂 yeah that's the right repo and those arguments look reasonable. it's in a...

Yeah, it looks like using the stuff in big_sweep.py was probably the way to go. I'm still a little bit stuck in feature_interp.ipynb, since I'm not entirely sure how to convert an autoencoders.learned_dict.UntiedSAE into an AutoEncoder (or even if that's something I should be doing). But. It looks like it did something that is approximately what was done before.

When I take the size-2048 dictionary, and plot the pairwise cosine similarity of features in that dict (excluding pairs which are the same feature, i.e. feat_0 x feat_1, feat_0 x feat_2, ... feat_0 x feat_2047, feat_1 x feat_2 ... feat_2046 x feat_2047 but none of feat_0 x feat_0), I get a nice normalish-looking distribution centered around a cosine similarity of 0. And when I take the size-1024 dict and the size-2048 dict, and do the MCS thing there, there are some features that are learned by both. Though not as many as I might hope. Graphs in question, as well as the script to generate them, attached.

Anyway, I have a suspicion that the issue here is just that I trained on all of 2 chunks for 1 epoch. I'll try setting up a 5 epoch run on 30 chunks overnight, see if that gets better results.

# Terrible hack of --epochs=0 to get the chunks into activation_data without having to use anything else in run.py
python run.py --epochs=0 --n_chunks=30 --save_after_mini=True --l1_exp_low=-13 --l1_exp_high=-12 --dict_ratio_exp_low=1 --dict_ratio_exp_high=2 --layer=2 --use_residual=True --use_wandb=False
# And retrain overnight
python run_using_sweep.py --epochs=5 --save_after_mini=True --l1_exp_low=-12 --l1_exp_high=-10 --dict_ratio_exp_low=1 --dict_ratio_exp_high=5 --layer=2 --use_residual=True --use_wandb=True --datasets_folder=activation_data/pile-10k-EleutherAI/pythia-70m-deduped-2/ --wandb_entity=joshuadavid
keen pivot
#

I think the minimal_feature_interp on my repo should be better. One caveat: if you’ve saved it as a pickle, my code will work. If not, you’ll need to replace the pickle load with torch.load()

#

Ya the MCS histogram should look much better than that (again, more data, and check the l1’s effect on sparsity)

hallow wyvern
keen pivot
#

That would mean upping the l1 term.

#

Are the pictures from retraining overnight on more data?

#

I’d be curious to see the results for MCS across two dicts

hallow wyvern
#

That's from the runs I looked at last night, haven't looked at results from overnight runs yet

Edit: speaking clearly is hard

keen pivot
#

Ah,what size is the model?

hallow wyvern
#

One of the runs in the sweep was a dictionary of size 2048 with l1 of 1.78e-3, wandb says sparsity of ~65.

Model is pythia-70m-deduped-2

keen pivot
hallow wyvern
#

doing that now

#

hm that's not good:

keen pivot
#

Hmmm… I can look into it in more detail tomorrow. Definitely doesn’t look right.

Random vectors would be around 0.4 I think for MCS so it’s kind of weird.

hallow wyvern
# keen pivot Hmmm… I can look into it in more detail tomorrow. Definitely doesn’t look right....

I think "MCS just under 0.2" makes sense for the best-of-2048 samples of random normalized 512-dimension vectors, at least based on some quick hacking about in a repl

>>> a = np.random.rand(512, 1024) - 0.5;
>>> b = np.random.rand(512, 2048) - 0.5;
>>> a /= np.linalg.norm(a, axis=0);
>>> b /= np.linalg.norm(b, axis=0);
>>> print("\n".join([
        f'{b:.2f}-{b+0.01:.2f}: {ct}'
        for ct, b in zip(*np.histogram(
            (a.T@b).max(axis=1),
            bins=100,
            range=(0.0, 1.0)
        ))
        if ct > 0
    ]))
0.12-0.13: ###
0.13-0.14: ##############
0.14-0.15: ###############################
0.15-0.16: ##########################
0.16-0.17: ###############
0.17-0.18: ########
0.18-0.19: ##
0.19-0.20: #
#

I suspect I broke the autoencoder training code, I'll look into what's going on and come back with an update once I figure it out

weak meteor
#

Where can I see a doc summarising what has been done so far and the plan ahead?

bitter turtle
bitter turtle
hallow wyvern
#

Oh yeah, that is correct. Though it doesn't seem to make a huge difference. Changing the first two lines to use np.random.randn and rerunning makes any difference at all but not a huge one.

a = np.random.randn(512, 1024)
b = np.random.randn(512, 2048)
a /= np.linalg.norm(a, axis=0);
b /= np.linalg.norm(b, axis=0);
print("\n".join([
    f'{b:.2f}-{b+0.01:.2f}: {chr(0x2588)*(ct//8-1)+(chr(0x2588+(ct%8)) if ct%8 > 0 else "")}'
    for ct, b in zip(*np.histogram(
        (a.T@b).max(axis=1),
        bins=100,
        range=(0.0, 1.0)
    ))
    if ct > 0
]))
0.11-0.12: ▉
0.12-0.13: ███▏
0.13-0.14: █████████████████████▊
0.14-0.15: ███████████████████████████████████▍
0.15-0.16: ███████████████████████████████▊
0.16-0.17: █████████████████▊
0.17-0.18: ████████▊
0.18-0.19: █▎
0.19-0.20: ▋
0.20-0.21: ▉
0.21-0.22: ▉

Also I note that the original cosine sim graph shows that a nonzero number of features in the size-1024 dict have MCS >> 0.2 with ones in the size-2048 dict, so whatever I broke didn't cause quite entirely random features to be returned. Just very very close.

bitter turtle
#

I think you can probably find this analytically, but eh

keen pivot
#

Might be good to write out our current plans for the week @bitter turtle @pallid current if you want! For me:

  1. Causal Alignment - write up fuller project for this & implement.

Extra-Todos: @weak meteor

  1. Look through many examples of the causal alignment stuff (mine is the parantheses example) to find a cool one
  2. Implement early-layers-to-late-layers, cause atm only can do later layers to earlier (related to causal alignment)
  3. Look into outlier features for a few days, try to find cause, try to pass torch to someone else
  4. Use features for activation engineering (may require training on large LLAMA models, which requires switching to baukit for training cause of GPU's)
    [Note: Roko, I don't expect this to be clear TODO's. Can explain more]
bitter turtle
#

Beat me to it! Was going to do exactly this this evening. I've asked Neel Nanda what kind of metrics he would like to see for causal alignment via email, hopefully will get back soon. If he doesn't I'll say fuck it and ping him or something, anyway. Will write up plans in about an hour and a half when I get back

#

Could you also elaborate on 2) for me? Tracing the causal path forwards doesn't strike me as immediately obviously useful. I also think 3) is a bit annoying but pretty universal, have heard some people propose solutions for newer archs, but it seems better to figure out a way to have our models ignore them

keen pivot
#
  1. There's a few papers on outlier features, but nothing mentions the \n or "." that I see in the outlier dimensions (which I've verified in just the outlier dimensions is a consistent token, but only in Pythia, not gpt-2 or others) which is novel AFAIK.
#

@bitter turtle I'd like your thoughts on this. Causal alignment (CA) feels circular here (at least for our use-case). CA assumes you think these parts of the model does some algorithm, which you can verify by changing the parts; if they have the same effect on the outputs & intermediate outputs as your algorithm predicts, then good.

The circular part is how do you come up w/ the hypothesis of the circuit in the first place w/o causal interventions?

This doesn't seem like a problem though, cause we can just do hypothesis generation by causal interventions. If the resulting algorithm is "simple" (whatever that means), then our features are good. If not, then booooo.

pallid current
#

since you're going to be in town from tomorrow i think we should have a big chat then and do a list with assigned people and such

bitter turtle
# keen pivot <@332271551481118732> I'd like your thoughts on this. Causal alignment (CA) feel...

Ok, so my thought process was basically this, except measuring alignment to a high-level abstract causal model would give us an additional measure of correctness of high-level abstract description or whatever. Like, we can use ACDC to find a ciruit A at some arbitrary noise threashold, eyeball a high-level hypothesis for it via a description of some (pruned) subcircuit B of A as a causal machine M (the description would be human-interpretable by design, something ACDC doesn't provide by default), and then we can measure the accuracy of C in predicting the activations of B, giving us a measure of the human-interpretability-ness of B. Basically, the idea is that ACDC acts as a quick pruning strategy at some arbitrary noise threashold to help get started generating hypothesies for circuits in the sparse basis.

That was the original plan, but I am now very unsure about how we can compare systematically the scores of circuits found using the sparse basis and circuits in some other basis; there is probably a large variance of scores and an absurd number of circuits with fuzzy borders between them, and a lot of room for noise in where we draw the boundary between one circuit and another, or how we prune etc. Potentially we can come up with some complexity measure of the high-level description and measure the trade-off for both basies over a number of circuits, and if sparse coding is good we should see an improvement there, but that might be a little beyond the scope of this project.

keen pivot
keen pivot
#

This is ACDC w/ "[percentage effect]% | [cosine similarity]"

One thing I'd also like to check is the effect on intermediate layers on others. In the other graph, there was a connection between 4_1030 & 3_1273, but not this time. I should be able to easily record this & choose not to show it if the effect is < 1% or something. This is also another set of choices to make when implement this to compare to!

#

Also, I'm choosing to only look at the top-5 max-activating examples for a feature when I look at the differences when ablating causal features.

keen pivot
#

This is w/ top-k examples set to 10, whereas the others were 5. So some connections will be different.

bitter turtle
# keen pivot Agreed on the nebulousness of circuits. I'll just give a go this week for severa...

well, I guess my point is that the actual circuits discovered by ACDC don't really matter that much, they are more like a guide for finding circuits if we go down the measuring-causal-alignment-ness route (better term needed: how about 'abstractiblity' or something similar), so the mediocre implementation would be fine. I'd like to get robustish measures of abstractibility though, that seems worthwhile. Like the idea of using different ways of doing ACDC, seems good for finding a large variety of circuits

bitter turtle
keen pivot
#

Alright, let's stick w/ abstractibility for now, lol

keen pivot
pallid current
keen pivot
#

You can also look at my folder on the node, where there is also the auteoncoders for layers 1-5 in my directory

bitter turtle
#

ok so finally found a method that consistently works a significant amount better than our standard linear encoders; it gets the same unexplained variance at about half the mean no features active. ran on 8 chunks compared to the 30 chunk run that hoagy did

method is basically linear dictionary as per usual but with 5-layer (could probably cut it down to 3 w/o significantly harming performance) learned ISTA-plus-momentum encoder

don't expect to use this significantly much for the circuit stuff I want to get into tomorrow, and also the sparsity-to-l1 thing is very unpredictable, but I guess it's nice to know we can do better than just linear encoding; if we end up needing sparser dictionaries we can just throw this + a lot of data at it. also note that I think sparsity-to-l1 will be a lot nicer if we pretrain with a linear encoder

pallid current
#

oh snap, super cool

#

purely linear decoder?

bitter turtle
#

yep

pallid current
#

damn that's big! potentially more in the tank still with more data maybe?

bitter turtle
#

uh yeah pretty noisy tho, i'd want to pretrain the decoder for like 1 chunk with linear encoders to get the dict right then freeze and train the encpder to that, then start training both in simul to get more consistent results, but yeah probably

#

took A Fucking While to find this but maybe useful especially for derivative stuff

pallid current
#

if you point me to some dicts thats are beyond the pareto frontier i'll see how they do on autointerp

bitter turtle
bitter turtle
pallid current
#

my intuition is that the noise reduction isn't the computationally difficult part, compared to doing good feature finding, so i expect it not to help too much but i could easily be wrong

#

also what's the role of ISTA in this setup? i thought in ISTA et al the encoder was just a set of feature weights learned fresh for each case, rather than a particular (eg 5 layer) formula?

bitter turtle
# pallid current also what's the role of ISTA in this setup? i thought in ISTA et al the encoder ...

Right, so LISTA is basically ISTA but with some parameters learned, all unrolled into a net, so each layer corresponds with one iteration. It's more computationally efficient+converges better or something, there's a bunch of lit about it. Specifically this is LISTA+a momentum update, can't tell quite what it's called I think LFISTA is a reasonable name which I think I saw in the lit somewhere but can't find it now

bitter turtle
#

ok, slightly good signal/sanity check, we're consistently beating [take the top-k components of PCA and project to that subspace)

#

the 'half mean no activations' was at 0.05, seems to be less big overall, but I think sparsity in ~100 range is reasonable anyways

bitter turtle
#

@keen pivot where is the code you are using for the circuit stuffs?

#

I might also wait until you and hoagy have your meeting not sure what to do right now atm

bitter turtle
#

Ah brill, sorry I missed that!

pallid current
#

here's the correlation between interp score and feature variance, skew and kurtosis:

bitter turtle
#

Cool, what are your takeaways on this? I'm not sure how much I trust autointerp. How many dicts are here?

pallid current
#

takeaways are: searching by high variance (and mean, and % cases active, have also checked those now) for good features is not going to work, bit disappointing because i hoped that might give signal for which feats to choose in highly overcomplete dicts (though would be worth rerunning this with much larger sizes)

#

skew and kurtosis seem to be pretty much identical, no distinct signal between them but they're a reasonable proxy for feature goodness

pallid current
#

might clean and send to openai, i think the above graph is wrong i logged the wrong variable lol but the effect stands, will update in a bit

bitter turtle
#

Yep yep

bitter turtle
keen pivot
shell mural
#

Looks extremely cool, help me interpret what I'm looking at

#

Is this a dictionary circuit

keen pivot
#

Oh, I was trying to figure out how to make it high-resolution, but you just click "open in browser" after you click on it initially

keen pivot
#

The text is my interpretation of what the feature means

shell mural
#

Insane, extremely cool

keen pivot
#

The percentage here means, given 10 activating examples of feature 4_1030, when I ablate feature 3_891, those activations go down by 36% on average.

#

The "... | 0.83" is cosine similarity to track how similar the directions are.

keen pivot
pallid current
#

running autointerp over the lista dicts and getting a v high proportion with no activations

#

assuming that it's layer2resid

keen pivot
#

Each is 2k. Can you plot like a hist of it?

#

I'm getting decent perplexities for layer2resid, so must be true

#

Layer2resid:

Perplexity for l1=1.16E-05: 51.76
Perplexity for l1=1.35E-05: 93.64
Perplexity for l1=1.56E-05: 147.65
Perplexity for l1=1.81E-05: 97.01```
Layer3resid:

Perplexity for l1=1.00E-05: 117.17
Perplexity for l1=1.16E-05: 67.78
Perplexity for l1=1.35E-05: 112.00

#

Though they're surprisingly close which may mean more about the similarity of the residual stream. Still should investigate this (in case something fishy is going on/code error)

keen pivot
pallid current
keen pivot
#

Min is 43 perplexity. Base model is ~25

pallid current
#

im getting an OOM when i i try to generate the histogram, dont wanna screw the big run 😟

pallid current
keen pivot
keen pivot
#
Perplexity for l1=1.16E-05: 51.76
Perplexity for l1=1.35E-05: 93.64
Perplexity for l1=1.56E-05: 147.65
Perplexity for l1=1.81E-05: 97.01
Perplexity for l1=2.10E-05: 46.16
Perplexity for l1=2.44E-05: 55.72
Perplexity for l1=2.83E-05: 43.57
Perplexity for l1=3.28E-05: 95.97
Perplexity for l1=3.81E-05: 98.02```
#

Super noisy down there. Maybe they're learning the identity

pallid current
#

that's a lot of variation!

keen pivot
#

Huge

pallid current
#

hard to learn the identity through 5 mlp layers!

keen pivot
#

Imma do qualitative on the 43 one to see if it learned the identity...oh, lololol

#

Also w/ 2k features

#

I can at least look at the decoder.

#

This is previous results on a good dict for layer 3. So the LISTA results are better, but may not be if they're learning an identity like thing.

#

43 would be huge if it's real. Would just need to train more & do better ML for convergence. I also expect needs to be bigger based off previous results replied to above.

#

I also like Aidan's idea (on a call from today) of comparing PCA w/ the dict across the same number of dimensions for both perplexity diff & variance explained.

I don't think it should be better because one benefit of dictionary is we can use more dimensions than the original, but if it is, then that's a cool result.

pallid current
#

what was the lowest perplexity for a non-lista dict?

keen pivot
#

These are the non-zero activations, and I gotta say it doesn't look good!

bitter turtle
bitter turtle
# keen pivot

I think this histogram thing might be broken slightly. @pallid current what code did you use to generate the previous histogram, could you try comparing that code and this code?

pallid current
bitter turtle
#

ok aaay

pallid current
#

wel at least, gpt2 is almost full and i thought that was from my jupyter server but ive cut that and it's still nearly full so unsure tbh

#

**gpu2 loool

keen pivot
#

I decided to look at the nonzero activations & see if they're meaningful.

bitter turtle
#

I think the hist thing in standard_metrics.py is broken anyway, if that's the one Logan's using

#

something is broken, somewhere

keen pivot
#

Another negative thing is that the two best perplexity dicts have near 0 MCS w/ each other

#

I'm using my own code! I still may be doing things wrong being not used to the new dicts!