Sparse Coding | EleutherAI | Page 3

bitter turtle Jul 19, 2023, 4:21 PM

#

Ok, so I think you can get a long way through the standard lens of 'neural networks as bayesian optimisers' here (at least, informally); if you assume some information-minimisation prior you might be able to get to (something-approaching) superposition downstream of that. More generally, I don't think that a formalisation of this is particularly useful (it seems Highly complex, and empirics are good and we should use them) and I also don't see how 'removing superposition' is a paricularly useful approach (doing so would significantly impact performance, and the model would probably get around our training guardrails somehow (see Neel's SOLU stuff ig?))

keen pivot Jul 19, 2023, 4:42 PM

#

@bitter turtle I'm not getting any high-mcs features for any of the dicts. I'm comparing dict_ratio_2 w/ dict_ratio_4, and have done tied & untied across all l1 values. It didn't seem to work for the known-to-work l1 of 1e-3, so 🤷

bitter turtle Jul 19, 2023, 4:47 PM

#

Ey

#

hmmmmmmmm

keen pivot Jul 19, 2023, 4:48 PM

#

I could've code something wrong. Have you been able to get any high-mcs features/graphs from this, even the toy?

bitter turtle Jul 19, 2023, 4:48 PM

#

what is the non-mcs performance of the dicts like

bitter turtle Jul 19, 2023, 4:48 PM

#

keen pivot I could've code something wrong. Have you been able to get any high-mcs features...

Haven't tested it

keen pivot Jul 19, 2023, 4:48 PM

#

bitter turtle what is the non-mcs performance of the dicts like

which metrics?

bitter turtle Jul 19, 2023, 4:49 PM

#

like loss on actual data, sparsity level etc and how does it compare to the other ones we did with the other trainer

#

if it's different something's wrong with the training code if it's the same something's wrong with your code or dictionaries are weird asf

keen pivot Jul 19, 2023, 4:51 PM

#

bitter turtle like loss on actual data, sparsity level etc and how does it compare to the othe...

the wandb doesn't track sparsity. I could run the model on some data & check.

bitter turtle Jul 19, 2023, 4:51 PM

#

Bear in mind this is 8x~2GB chunks, could be a lack-of-data thing

bitter turtle Jul 19, 2023, 4:51 PM

#

keen pivot the wandb doesn't track sparsity. I could run the model on some data & check.

Yeah that's what I meant

keen pivot Jul 19, 2023, 4:51 PM

#

bitter turtle Bear in mind this is 8x~2GB chunks, could be a lack-of-data thing

So this is 14GB overall for _7?

bitter turtle Jul 19, 2023, 4:51 PM

#

Yeah

#

Well 15 I think

#

@pallid current did you do autointerp with directions from the new run code or an older run

bitter turtle Jul 19, 2023, 4:56 PM

#

keen pivot <@332271551481118732> I'm not getting any high-mcs features for any of the dicts...

What do the distributions look like

keen pivot Jul 19, 2023, 4:56 PM

#

pallid current Jul 19, 2023, 4:57 PM

#

bitter turtle <@566946805028225034> did you do autointerp with directions from the new run cod...

i ran autointerp on dictionaries that logan ran with the old infra about two weeks ago

bitter turtle Jul 19, 2023, 4:59 PM

#

rats

#

what the hell is my thing doing the

keen pivot Jul 19, 2023, 4:59 PM

#

bitter turtle what the hell is my thing doing the

Let me do that sparsity check

bitter turtle Jul 19, 2023, 5:00 PM

#

Oh, how are you scaling L1, I might not have implemented that properly, might wanna check the loss function implementations

keen pivot Jul 19, 2023, 5:02 PM

#

bitter turtle Oh, how are you scaling L1, I might not have implemented that properly, might wa...

What do you mean?

"dict_ratio_4_6.pt": {"l1_alpha": 0.003162277629598975, "bias_decay": 0.0},
this is a good l1_alpha term

bitter turtle Jul 19, 2023, 5:05 PM

#

Like how exactly is the loss function implemented

#

I remember someone saying something about scaling by 1/J, and I'm wondering if I did that properly

#

@keen pivot

keen pivot Jul 19, 2023, 5:06 PM

#

Ah, I see where you did that.

#

You actually don't want to do that because it causes the diagonal thing

#

So these l1-values are really low

#

I'm getting a sparsity of 600/10000

bitter turtle Jul 19, 2023, 5:09 PM

#

Ah, that might explain it then

keen pivot Jul 19, 2023, 5:09 PM

#

It might also explain the really low reconstruction loss, lol

bitter turtle Jul 19, 2023, 5:09 PM

#

Could you look over the loss function when I reimplement it?

keen pivot Jul 19, 2023, 5:10 PM

#

Yep!

bitter turtle Jul 19, 2023, 5:10 PM

#

I'll just directly translate from the other one

#

Whoops paha

keen pivot Jul 19, 2023, 5:10 PM

#

it looks like just:

    l_l1 = (buffers["l1_alpha"] / c.shape[0]) * torch.norm(c, 1, dim=-1).mean()

removing the c.shape for both the tied and untied

bitter turtle Jul 19, 2023, 5:10 PM

#

Thank god it only takes like 16m to train lol

keen pivot Jul 19, 2023, 5:11 PM

#

Also tracking sparsity would be useful

bitter turtle Jul 19, 2023, 5:12 PM

#

yep will do

keen pivot Jul 19, 2023, 5:12 PM

#

I think it's:

x_hat.count_nonzero(axis=1).float().mean()

bitter turtle Jul 19, 2023, 5:12 PM

#

How would you like this measured?

#

ah just total

#

I might do that as well as 'num Nonzero per feature over last chunk' or something

keen pivot Jul 19, 2023, 5:13 PM

#

Like per token, there's how many nonzero latent activations

#

The features/token metric helps check if we've set the l1 too high (zero activations) or too low (several hundreds, so mostly identity)

bitter turtle Jul 19, 2023, 5:13 PM

#

bitter turtle I might do that as well as 'num Nonzero per feature over last chunk' or somethin...

Although I guess this is less useful and we can just measure this at the end

bitter turtle Jul 19, 2023, 5:37 PM

#

https://wandb.ai/sparse_coding/sparse coding/runs/hqh9d3p7

#

weird link

#

looks much more sparse now

#

but bloody hell these reconstruction losses

#

I've been thinking that we could using a more powerful encoder (a full-blown feed-forward multi-layer net) while keeping the decoder limited to a linear map

keen pivot Jul 19, 2023, 5:43 PM

#

@bitter turtle, one thing, the 8 naming scheme includes "_group_1" and the others do not. Is that intentional cause the 8 sized ones are bigger and split across two GPU's?

bitter turtle Jul 19, 2023, 5:43 PM

#

yes

keen pivot Jul 19, 2023, 5:45 PM

#

bitter turtle I've been thinking that we could using a more powerful encoder (a full-blown fee...

This could work. The benefit of the tied embedding is that the direction we're reading from is the same direction we're able to write to. But the future MLP layers don't have tied embedding, so maybe that's okay to not do tied?

#

Overall, the low reconstruction loss from earlier was caused by the high sparsity/near-identity dictionaries

bitter turtle Jul 19, 2023, 5:47 PM

#

Well, my thoughts are that we should let the net do more intelligent denosising than just cross-producting with the dict; not sure what you mean by the future MLP layers don't have tied embedding

pallid current Jul 19, 2023, 5:47 PM

#

looks like tied is getting way higher recon losses than untied 🤔 so confused about what extra computation it manages by not having them match up

bitter turtle Jul 19, 2023, 5:48 PM

#

yeah that's why I think we should just slap on a big net

keen pivot Jul 19, 2023, 5:48 PM

#

bitter turtle Well, my thoughts are that we should let the net do more intelligent denosising ...

One possible constraint for our dictionary learning is that we're learning features that the LLM is using in future layers, so we should limit ourselves to the capabilities of future layers

#

Also, I'm all for slapping the full net on now & running it, haha

bitter turtle Jul 19, 2023, 5:48 PM

#

ah, sure

bitter turtle Jul 19, 2023, 5:49 PM

#

keen pivot Also, I'm all for slapping the full net on now & running it, haha

I'll code it up but I worry it might be horribly unstable to train

keen pivot Jul 19, 2023, 5:49 PM

#

Could skip tied for now

pallid current Jul 19, 2023, 5:49 PM

#

slapping on the full net is also equivalent to the standard dictionary learning thing where you just freely optimize the dict entries

bitter turtle Jul 19, 2023, 5:49 PM

#

ye, that's what I was thinking; might also be useful to have a deterministic denoiser tho? idk

pallid current Jul 19, 2023, 5:49 PM

#

still feels kinda wrong to me tho

keen pivot Jul 19, 2023, 5:50 PM

#

pallid current still feels kinda wrong to me tho

The wrongness will show up in ablating the feature direction & logit lens. If it doesn't have a meaningful effect (like our current ones do), then it's bad on that metric.

bitter turtle Jul 19, 2023, 5:50 PM

#

pallid current still feels kinda wrong to me tho

really? how come?

pallid current Jul 19, 2023, 5:53 PM

#

bitter turtle really? how come?

because it breaks my mental model of how the network is using the features, like in my head it kinda looks like TMOS, you have features + inference, you use the negative bias to screen the interference and then you reconstruct in the same direction

#

if its not doing that then i guess i just dont have a good picture of what's going on

bitter turtle Jul 19, 2023, 5:54 PM

#

@keen pivot should be in /mnt/ssd-cluster/resid_layer_2_19_07_scaled_l1, feel free to delete the other one

bitter turtle Jul 19, 2023, 5:55 PM

#

pallid current because it breaks my mental model of how the network is using the features, like...

I feel like this relies too heavily on TMOS being a perfect model for model internals, like more might be going on than just this

#

could be some additional denosing/information loss

pallid current Jul 19, 2023, 5:56 PM

#

i wonder if it would help the model if we added back the bias immediately after the sparsity penalty. like at the moment it adds the bias, RELU, and then has to reconstruct but it'll be missing an amount equal to the bias, so maybe we should just add it straight back on?

bitter turtle Jul 19, 2023, 5:56 PM

#

conditional on the feature being nonzero presumably

#

yeah could do

#

pretty confident that we are solving a different problem to the model tho, the model can do things lossily/in superposition, while we are looking for perfect replications or whatever

pallid current Jul 19, 2023, 5:58 PM

#

did we work out what the runs yesterday got such low recon loss btw?

bitter turtle Jul 19, 2023, 5:58 PM

#

I was scaling l1 wrong

pallid current Jul 19, 2023, 5:58 PM

#

oooo

#

classic

bitter turtle Jul 19, 2023, 5:58 PM

#

real

#

@keen pivot are you checking the dicts nw

#

cool

keen pivot Jul 19, 2023, 5:59 PM

#

#

Just checked a random one and it looks good

bitter turtle Jul 19, 2023, 6:00 PM

#

thank fuck

#

dunno how proper ML researchers do it tbh

#

the feedback latency is horrifying

#

like how can you trust your own code to train for 8 months

keen pivot Jul 19, 2023, 6:04 PM

#

#

#

This over 3 different l2-biases.

bitter turtle Jul 19, 2023, 6:06 PM

#

slightly cropped

keen pivot Jul 19, 2023, 6:06 PM

#

This also shows what I noticed earlier, which is tied having better MCS. If Aidan is right about LLM solving noisy stuff, then we may have to ignore high-MCS ~~entirely~~ mostly

bitter turtle Jul 19, 2023, 6:06 PM

#

hmm interesting

bitter turtle Jul 19, 2023, 6:06 PM

#

keen pivot This also shows what I noticed earlier, which is tied having better MCS. If Aida...

how come?

pallid current Jul 19, 2023, 6:06 PM

#

keen pivot This also shows what I noticed earlier, which is tied having better MCS. If Aida...

"LLM solving noisy stuff"?

bitter turtle Jul 19, 2023, 6:07 PM

#

bitter turtle pretty confident that we are solving a different problem to the model tho, the m...

this I think

keen pivot Jul 19, 2023, 6:08 PM

#

bitter turtle how come?

Tied embedding makes there be only 1 solution, as in one set of residual stream dims to read/write from. Untied allows many directions to read from. So maybe two different dicts learn different features if they're untied.

bitter turtle Jul 19, 2023, 6:08 PM

#

not sure what you mean

keen pivot Jul 19, 2023, 6:08 PM

#

But if we ignore high-MCS, and just say "if the sparsity isn't insanely high or low & we get good reconstruction loss, then maybe the features themselves are good", which we can check by hand or by auto-interp

keen pivot Jul 19, 2023, 6:09 PM

#

bitter turtle not sure what you mean

Anything specific? I can just say it in different words

bitter turtle Jul 19, 2023, 6:09 PM

#

sure

#

that weirdly enough often actually works

keen pivot Jul 19, 2023, 6:09 PM

#

Tied embedding makes the model read and write from the same direction

#

untied allows the model to read from multiple directions and write to only 1 direction

#

model= autoencoder

bitter turtle Jul 19, 2023, 6:11 PM

#

well, I disagree there, untied allows the model to read from one different direction, not multiple. this might be an important distinction. I think allowing it to read from one different directions is weird and slightly wrong, we should allow it to read from many instead

keen pivot Jul 19, 2023, 6:11 PM

#

Let me write an example

bitter turtle Jul 19, 2023, 6:12 PM

#

but I understand what you're getting at

bitter turtle Jul 19, 2023, 6:12 PM

#

keen pivot This also shows what I noticed earlier, which is tied having better MCS. If Aida...

not sure how you get from that to this, however

keen pivot Jul 19, 2023, 6:14 PM

#

bitter turtle well, I disagree there, untied allows the model to read from one different direc...

I'm using the residual stream to define the direction, not the weights.

#

Ah, maybe not.
I wanted to say something like:

Because of the ReLU & negative bias, you can have multiple ratios of residual stream that activate the feature, which are multiple directions in residual stream (though the same direction in weights since those are frozen). For example, for weights (1,1) reading in from the first two dimensions of residual stream, we have:

F_1 = ReLU(w1*r1 + w2*r2 +bias)
F_1 = ReLU(r1 + r2 - 1)
Which can be positive if r1 or r2 is large or a sum of them.

Though writing into the residual stream w/ the decode is always the same direction.
Though this is true for tied embedding as well.

#

Also another counterpoint: We may just need a better l1 value for the untied model.

#

@bitter turtle, would we be able to run w/ a larger encoder/full net soon? I can code it in, though I didn't know what you had in mind.

bitter turtle Jul 19, 2023, 6:23 PM

#

yep

#

I was gonna do that rn but got distracted

#

pahah

keen pivot Jul 19, 2023, 6:25 PM

#

Going rock climbing, can look at the models when I get back if they happen to be trained by then!

cosmic moon Jul 19, 2023, 7:04 PM

#

https://twitter.com/s_scardapane/status/1681661683297579010

#

https://docs.google.com/presentation/d/1KKjR8w4YRgS__RPrtspCFwkxR1QuxtwGjIPvIF6XPJo/edit#slide=id.g23ddfb2348c_0_0

Google Docs

SSIE_2023_Efficient

Designing modular and efficient neural networks with conditional computation SSIE 2023, July 2023 Simone Scardapane email alla fine

keen pivot Jul 19, 2023, 7:13 PM

#

@cosmic moon Could you give a short blurb on why the linked work is related to this project channel?

bitter turtle Jul 19, 2023, 7:16 PM

#

I mean, it seems at least tangentially related to the general theme of this channel (searching for modularity in NNs), but not neccesarily directly related to sparse coding particularly.

keen pivot Jul 19, 2023, 7:43 PM

#

@bitter turtle, I'm running just a basic 2-layer encoder (first to 1/2 dictionary size, then to full) on cuda:0 on the old infra, just fyi. I didn't know how to easily change yours to add a second encoder param.

bitter turtle Jul 19, 2023, 7:51 PM

#

yep I just did it

#

I'll do a commit with the 2 layer encoder in a bit, just need to test it

#

(I'm doing d_act -> dict_size -> dict_size fyi, don't have any reason to prefer either other than compute)

keen pivot Jul 19, 2023, 7:53 PM

#

@pallid current , we ever figure out why lee is against a bias for the decoder?

bitter turtle Jul 19, 2023, 7:55 PM

#

ok, will continue testing after I've eaten

keen pivot Jul 19, 2023, 7:58 PM

#

bitter turtle ok, will continue testing after I've eaten

Ya, I think I'll do some quick tests to get a rough idea on the effect & l1 hyperparams, but your code will be much more efficient at training many dictionary sizes

keen pivot Jul 19, 2023, 8:16 PM

#

An additional thing would be to try a tied embedding on a really big dictionary (32x) w/ a lot of data (or really until convergence)

keen pivot Jul 19, 2023, 9:33 PM

#

@pallid current , I looked at the hyphen stuff and they're quite interpretable and separable

#

Slight update on the "2 layer encoder": It gets a reconstruction loss of 0.070, which is okay, but not amazing. The high-MCS is also garbage (<1%).

I'm just running the tied embedding w/ much larger dictionaries and for much more data.

keen pivot Jul 19, 2023, 9:58 PM

#

For the is/are one, some are clearly separable, but like 2-3 others look similar, but I believe they're is/was that only activate in a specific distribution of text (e.g. technical, news article, etc).

I currently don't have the tools to figure that out, because I'd need to know which previous features cause these to activate.

pallid current Jul 19, 2023, 10:18 PM

#

ellena reid in my seri mats group is working on applying sparse coding to audio transcription models and pointed out that it's a bit weird in our tiedSAEs to normalize the decoder weights after they've already been applied the first time

bitter turtle Jul 19, 2023, 10:30 PM

#

oh what

#

yeah my bad

#

I guess this could actually make sense

#

like, plausibly the scaling could be different for optimal reconstruction

#

we should maybe look into that

#

@keen pivot https://wandb.ai/sparse_coding/sparse coding/runs/f49wmqnt
~~model with 2-layer encoder gets some good results for certain dict sizes~~ no it doesn't there's an absurd sparsity-reconstruction tradeoff

#

also seeing loads of totally dead ones

#

I guess we might have issues where they just optimize for sparsity

#

yeah this looks like not immediately good

#

@pallid current latest commit should be ready for merge

keen pivot Jul 19, 2023, 10:47 PM

#

bitter turtle I guess we might have issues where they just optimize for sparsity

Ya wanna shoot for around 20 sparsity in general. I also saw in mine horrible MCS.

#

For 2-layer, I got 1e-4 as a good l1, but it was still pretty bad & didn't have too great a reconstruction

bitter turtle Jul 19, 2023, 10:48 PM

#

yeah it might just be insanely brittle

bitter turtle Jul 19, 2023, 11:22 PM

#

bitter turtle <@566946805028225034> latest commit should be ready for merge

If you wanna do the PR, then I can sync to your branch

#

If not I can idm

pallid current Jul 19, 2023, 11:45 PM

#

yo yeah i'll starting merging now

pallid current Jul 20, 2023, 1:20 AM

#

merged the new ensemble stuff, currently rerunning some of the graphs in the post with argmax(max) instead of argmax(mean) and still tryna fun autointerp on the new results lol

#

off for a bit but will prob work a bit later

keen pivot Jul 20, 2023, 1:38 AM

#

I'm just trying to get reconstruction loss down by doing tied w/ smaller l1's to see if that helps, while still learning meaningful features.

#

Additionally, tomorrow I can look into the dataset of predictions that do best & worst on perplexity to see if there's a pattern.

pallid current Jul 20, 2023, 5:54 AM

#

been reading this paper that was recommended to me yesterday, v relevant: https://arxiv.org/pdf/2210.01892.pdf

#

notes:

they define the capacity allocated to a feature i as $C_i = \frac{(W_i\cdot W_i)^2}{\sum_j(W_i, W_j)}$ where $W_i$ is the weight vector in the embedding matrix for feature $i$
total capacity can be no more than (but can be less than) the total embedding dimension D
find that across the model, capacity will be allocated at the point where the marginal value of capacity in that feature is some constant value (if the marginal value were different then you could reduce loss by reallocation). therefore you can only expect to see superposition if there are decreasing returns to capacity, which they say occurs for inputs of high sparsity or kurtosis.
asserts a strong relationship in general between sparsity and kurtosis which i hadn't understood before (this seems quite well studied in neuroscience eg https://iopscience.iop.org/article/10.1088/0954-898X/12/3/302, should probably look further since this is 20yrs old)
find that you get full capacity if you have a weight matrix which is semiorthogonal, meaning that $WW^T=\lambda I$, (as well just going diagonal of course). can combine the approaches by having orthogonal subspaces - at small dimension this becomes TMOS' polytope model.

sinful shuttleBOT Jul 20, 2023, 5:56 AM

#

hoagy

bitter turtle Jul 20, 2023, 7:19 AM

#

Oh yeah v good paper, did you have any thoughts on how we could use this? Will check out neuroscience paper

#

hhhh Bristol uni doesn't provide access

keen pivot Jul 20, 2023, 4:22 PM

#

bitter turtle Oh yeah v good paper, did you have any thoughts on how we could use this? Will c...

I definitely need another hour or so to grok the paper, but one part I'd like to understand is kurtosis.

wiki says it's a measure of the (amount of outliers? extremity of them?).

So if we try to intentionally find directions w/ high kurtosis, then, given an activation dataset, we can find these directions by having a measure of kurtosis, then optimizing the direction, and defining loss as kurtosis?

Then repeat and add a diversity term so you find new directions.

#

I also don't understand sparsity & kurtosis being similar (the neuro paper says they're not, except for sampling kurtosis?). Like you can have many outliers & that shouldn't effect the frequency of them?

keen pivot Jul 20, 2023, 4:53 PM

#

Update: I'm seeing better reconstruction losses for higher sparsity values for tied embedding (e.g. 0.5-0.4, where earlier we had 0.75). This is unsurprising, but I still need to see if there are meaningful features, we didn't learn the identity, & the actual perplexity difference.

flint prawn Jul 20, 2023, 5:01 PM

#

bitter turtle hhhh Bristol uni doesn't provide access

📎 net.12.3.255.270.pdf

bitter turtle Jul 20, 2023, 5:02 PM

#

Tysm

bitter turtle Jul 20, 2023, 5:08 PM

#

keen pivot I also don't understand sparsity & kurtosis being similar (the neuro paper says ...

Not sure what you mean, kurtosis isna measure of heaviness of tails which is like approximately what sparsity is

#

For instance, on symmetric uniform distribution with additional mass on 0, kurtosis ~ sparsity³

keen pivot Jul 20, 2023, 5:16 PM

#

bitter turtle Not sure what you mean, kurtosis isna measure of heaviness of tails which is lik...

From the wiki:

This number is related to the tails of the distribution, not its peak;[2] hence, the sometimes-seen characterization of kurtosis as "peakedness" is incorrect. For this measure, higher kurtosis corresponds to greater extremity of deviations (or outliers), and not the configuration of data near the mean.
Edit: I see this doesn't talk about your point. Could you explain how you think kurtosis relates to the shape of the distribution?

keen pivot Jul 20, 2023, 5:20 PM

#

bitter turtle Not sure what you mean, kurtosis isna measure of heaviness of tails which is lik...

Can you define sparsity here?

I want to say frequency of the feature activating, which if we operationalize "feature activating" to mean "activates more than N std's above mean" (or something), then that makes sense to say a feature is more sparse if it has a thinner tail (and vice versa)

bitter turtle Jul 20, 2023, 5:24 PM

#

keen pivot Can you define sparsity here? I want to say frequency of the feature activatin...

Frequency of feature activation, i.e. frequency that X is drawn from the uniform distribution and not 0

bitter turtle Jul 20, 2023, 5:25 PM

#

keen pivot I also don't understand sparsity & kurtosis being similar (the neuro paper says ...

Also the neuro paper doesn't say this afaict?

keen pivot Jul 20, 2023, 5:26 PM

#

bitter turtle Also the neuro paper doesn't say this afaict?

Although these ideas are related, they are not identical, and the most common measure of lifetime sparseness - the kurtosis of the lifetime response distributions of the neurons - provides no information about population sparseness.

bitter turtle Jul 20, 2023, 5:26 PM

#

afacit they are making a distinction between two types of sparseness

#

afaict

keen pivot Jul 20, 2023, 5:27 PM

#

And sorry, I think I'm coming across as making strong claims, but I'm just confused and appreciate your help!

bitter turtle Jul 20, 2023, 5:27 PM

#

oh no not at all

bitter turtle Jul 20, 2023, 5:28 PM

#

bitter turtle afacit they are making a distinction between two types of sparseness

rather than saying that 'kurtosis doesn't measure sparseness' they are saying 'kurtosis (as a measure of sparseness on this axis) doesn't measure sparseness on this other axis'

keen pivot Jul 20, 2023, 5:34 PM

#

bitter turtle Frequency of feature activation, i.e. frequency that X is drawn from the uniform...

Just to ground out an example, I've plot the activations of the residual stream & one of the dictionary features

#

Maybe you could automatically search for meaningful features by finding directions that match the right graph, more than the left. One measure may be kurtosis, which would be the E[normalized(x)^4], which the right graph has more than the left, right?

bitter turtle Jul 20, 2023, 5:38 PM

#

the right is basically normal right? I think it has zero kurtosis

keen pivot Jul 20, 2023, 5:38 PM

#

Oh, it might be a different order on my device. The residual stream one looks normal to me, and the dictionary one doesn't

bitter turtle Jul 20, 2023, 5:38 PM

#

bitter turtle Jul 20, 2023, 5:39 PM

#

keen pivot Oh, it might be a different order on my device. The residual stream one looks no...

oh

#

basically a 'spike and slab' (i.e. sparse) distribution looks more like the red one than say a normal distribution

#

you can't strictly speaking draw it as a pdf because [][][][] but like 'things with their weight on the tails spread out over a larger area' have higher kurtosis I think

keen pivot Jul 20, 2023, 5:45 PM

#

Okay, so kurtosis will be lower for rarer features, right?

bitter turtle Jul 20, 2023, 5:45 PM

#

uh

#

Noooo

#

Don't think so

#

Wait one sec

#

Let me calculate this

#

actually no im really confused

#

might have done my maths wrong

keen pivot Jul 20, 2023, 5:49 PM

#

Okay, but suppose we have two feature, lolololol

#

If kurtosis is E[x^4], then if you have 1% of values at e.g. 100 after normalization compared w/ 0.01% of values at 100, then the expectation takes that into account and gives different values?

bitter turtle Jul 20, 2023, 6:06 PM

#

yes?

pallid current Jul 20, 2023, 6:08 PM

#

bitter turtle Oh yeah v good paper, did you have any thoughts on how we could use this? Will c...

i think we should definitely track the total capacity and capacity per feature as metrics. and also see whether our feature matrices are block sparse- i dont know how to test this but should be easy enough

keen pivot Jul 20, 2023, 6:12 PM

#

bitter turtle yes?

So then one will be a rarer feature (e.g. 0.01% compared a/ 1%) which will mean it has a lower kurtosis.

bitter turtle Jul 20, 2023, 6:12 PM

#

bitter turtle For instance, on symmetric uniform distribution with additional mass on 0, kurto...

mb I goofed it's 1/p I get it now

#

1/sparsity

#

phee

#

phew

bitter turtle Jul 20, 2023, 6:12 PM

#

pallid current i think we should definitely track the total capacity and capacity per feature a...

sounds great, block sparsity kind of what I was going for with the covariance stuff

keen pivot Jul 20, 2023, 6:13 PM

#

pallid current i think we should definitely track the total capacity and capacity per feature a...

I haven't read it enough to code this. Does it seem easy to integrate?

bitter turtle Jul 20, 2023, 6:13 PM

#

bitter turtle sounds great, block sparsity kind of what I was going for with the covariance st...

don't have a good culmative metric

bitter turtle Jul 20, 2023, 6:13 PM

#

keen pivot I haven't read it enough to code this. Does it seem easy to integrate?

yep, simple calculation

#

can do it now

pallid current Jul 20, 2023, 6:15 PM

#

keen pivot So then one will be a rarer feature (e.g. 0.01% compared a/ 1%) which will mean ...

excess kurtosis of a bernoulli variable (x axis is the p param from 0-1 but i overwrote the xticks func sozzz):

#

goes to +inf at x = 0 or 1

keen pivot Jul 20, 2023, 6:21 PM

#

& excess kurtosis is just regular kurtosis - 3, so that normal distribution is set to 0?

keen pivot Jul 20, 2023, 6:22 PM

#

pallid current excess kurtosis of a bernoulli variable (x axis is the p param from 0-1 but i ov...

Does this relate to "rarer features will have lower kurtosis" ?

pallid current Jul 20, 2023, 6:22 PM

#

also that kurtosis isnt just E[x^4], it's the standardized moment, you do E[(x-mean)/std_dev]

keen pivot Jul 20, 2023, 6:23 PM

#

Correct

pallid current Jul 20, 2023, 6:23 PM

#

keen pivot Does this relate to "rarer features will have lower kurtosis" ?

i think this shows that rarer feats (when viewed correctly/separated out) will have higher kurtosis

keen pivot Jul 20, 2023, 6:28 PM

#

pallid current i think this shows that rarer feats (when viewed correctly/separated out) will h...

I just checked & you're right. I don't understand how the bernoulli variable graph gives intuition for that.

pallid current Jul 20, 2023, 6:29 PM

#

keen pivot I just checked & you're right. I don't understand how the bernoulli variable gra...

because as the feature likelihood goes down to 0, the kurtosis rises super fast

#

ok it doesnt give intuition lol but it does show it!

keen pivot Jul 20, 2023, 6:31 PM

#

Ah, I see. Gotcha

#

The only intuition I've got is that normalizing causes the effect. If you have more outliers, then the mean is shifted & std is greater, which has a large effect on the ^4 part.

bitter turtle Jul 20, 2023, 6:51 PM

#

that's why you standardise it

#

similar figure for kurtosis (not excess) vs feature density (where the density (x-axis) is the probability that the feature is nonzero and uniformly distributed over [-1, 1])

#

goes to inf the rarer the feature is

bitter turtle Jul 20, 2023, 7:04 PM

#

pallid current i think we should definitely track the total capacity and capacity per feature a...

I lied, we can't have exact capacity here, because we don't know feature 'magnitudes'; we can probably still expect proportional capacity to be a useful metric though (defining proportional capacity as 'capacity over normed dict' which is probably a synonym for 'amount of interference on feature i')

#

unsure what 'total proportional capacity' would get us, would expect that to be fairly constantly minimal

pallid current Jul 20, 2023, 7:11 PM

#

bitter turtle I lied, we can't have exact capacity here, because we don't know feature 'magnit...

can't we just treat our predicted feature activations as the true magnitude of the true features?

#

and once we do that our set up becomes basically identical to that in the paper?

bitter turtle Jul 20, 2023, 7:14 PM

#

not exactly sure. In the paper there is at least some idea of the range of activations of features, and it's kind of uniform across features, while with ours we see ridiculous variance in activation

pallid current Jul 20, 2023, 7:37 PM

#

is that a problem tho? to me it's just an interesting part of our findings (didnt know btw!) because if there are different scales then presumably that creates more interference for any direction with has some degree of cosine sim

#

i think the capacity paper predicts that those directions would have more of a full dimension to themselves

bitter turtle Jul 20, 2023, 8:00 PM

#

Hmm, yeah

#

I think that by default the absurd activation dimensions (the ones with like 1k max activation) have a full dimension to themselves (1k/(1k + n*epsilon) is basically just 1) even without the predictions of the paper

#

Definitely can look at it tho

#

we shall see

bitter turtle Jul 20, 2023, 8:10 PM

#

pallid current can't we just treat our predicted feature activations as the true magnitude of t...

How are we measuring this anyway, I feel like mean activation is a decent idea since then it becomes an approximation for expected interference?

#

We could just directly measure Expected interference

#

Let $c_{k,i}$ be the value taken by feature $i$ on batch $k$. Then,
$C_i = \frac{1}{K} \sum_k \frac{c_{k,i}}{\sum_j (c_{k,j} * (W_i \cdot W_j)^2)}$

sinful shuttleBOT Jul 20, 2023, 8:17 PM

#

aidan ewart

bitter turtle Jul 20, 2023, 8:17 PM

#

I feel like expected interference is kind of what the fractional dimensionality thing is measuring in their paper.

bitter turtle Jul 20, 2023, 8:23 PM

#

bitter turtle Let $c_{k,i}$ be the value taken by feature $i$ on batch $k$. Then, $C_i = \frac...

Or maybe just over the cases where it's nonzero

pallid current Jul 20, 2023, 8:24 PM

#

isn't this identical to the paper if you take c_i to be the size of the incoming weight instead of activation

#

and adding dot products and activations seems wrong

bitter turtle Jul 20, 2023, 8:25 PM

#

Wdym 'incoming weight'

bitter turtle Jul 20, 2023, 8:25 PM

#

pallid current and adding dot products and activations seems wrong

I'm multiplying?

pallid current Jul 20, 2023, 8:25 PM

#

like i think they're basically measuring expected interference given that the input is uniform or something?? and then normalizing to express it as fractions of a dimensions

bitter turtle Jul 20, 2023, 8:26 PM

#

Oh yeah sure agree, that's also what this is

pallid current Jul 20, 2023, 8:26 PM

#

bitter turtle I'm multiplying?

yeah true haha

bitter turtle Jul 20, 2023, 8:27 PM

#

\times was too many characters

bitter turtle Jul 20, 2023, 8:30 PM

#

pallid current like i think they're basically measuring expected interference given that the in...

Yeah ok want to measure the 'frac when feature is nonzero' thing then probably

pallid current Jul 20, 2023, 8:38 PM

#

so like have some empirical interference measure based on the activations which we can use as a complement to the weight based ones?

#

that makes sense. im gonna go off discord for a bit cos i havent done focused work properly in a couple days, back this eve

bitter turtle Jul 20, 2023, 8:41 PM

#

since our dict is normed I don't see how we can do a purely weight based one

shell mural Jul 20, 2023, 9:10 PM

#

Yo, I run a mechanistic interpretability reading group on another server. The topic for next wednesday has been chosen: dictionary learning!

#

We read papers and posts related to given topic each week and occasionally invite the authors of the work we went through to join for a Q&A sorta discussion

#

I invited @keen pivot as a guest speaker, but there's others in here e.g. @pallid current (and possibly more that im missing) that are also heavily involved with the project. Wanna join too?

pallid current Jul 20, 2023, 9:15 PM

#

What time?

shell mural Jul 20, 2023, 9:15 PM

#

It's tentatively 1pm EST (10am PDT) by default, then we allow a fudge factor for flexibility on the guest speakers part

#

Logan said he was free at that time, not sure about you guys though

pallid current Jul 20, 2023, 9:18 PM

#

OK nice yeah I can say hi at that time tho I'm sure Logan's mostly got it covered

pallid current Jul 20, 2023, 9:19 PM

#

pallid current that makes sense. im gonna go off discord for a bit cos i havent done focused wo...

Immediately set off a slow run right after posting this lmao

bitter turtle Jul 20, 2023, 9:45 PM

#

Pahaha

bitter turtle Jul 20, 2023, 9:45 PM

#

shell mural Yo, I run a mechanistic interpretability reading group on another server. The to...

Which server?

#

Nvm got it 😄

#

Would definitely be interested in coming to the Q+A, to get a feel of other people's takes on this direction

bitter turtle Jul 20, 2023, 10:09 PM

#

does anyone know of a 'one-sided' kurtosis for asymmetric distributions? tempted just to use E[X^3]/variance or E[X^4]/variance

#

https://wandb.ai/sparse_coding/sparse coding/runs/1w4vlec6 <- run tracking expected interference and 'asymmetric skew' (E[X^3]/sd^3) (random metric I pulled out of a hat)

#

hmm, I wonder if I have botched expected interference, these seem low

#

should be approaching 1

pallid current Jul 20, 2023, 10:33 PM

#

currently running a small auto_interp cycle (40 feats) on all 12 of the non-tied final epoch dicts from tuesday's run

bitter turtle Jul 20, 2023, 10:34 PM

#

awesome

pallid current Jul 20, 2023, 10:36 PM

#

some obvious differences in how often some of them just dont have enough nonzero activations. just from eyeballing it loooks like there might be some differences in average score but wont really know at all till i run the graphs

#

will prob need more than 40 to have any confidence tbh

bitter turtle Jul 20, 2023, 10:58 PM

#

I'd be very interested to know how the results change when you do autointerp on the decoder directions, assumed you were doing that already

pallid current Jul 20, 2023, 11:10 PM

#

bitter turtle I'd be very interested to know how the results change when you do autointerp on ...

yeah it just never crossed my mind, my bad, will run it once the current exp is done. i should prob also look into whether i can parallelize the gpt4 api calls, it's starting to be a big bottleneck

shell mural Jul 21, 2023, 12:02 AM

#

Ok sweet! I'll make an announcement on the server and send hoagy an invite

#

Logan also requested I send a google calendar invite, if either of you want one of those too you can dm me your email and I'll figure that out

pallid current Jul 21, 2023, 12:17 AM

#

results from autointerping 12 nontied 2-dict-ratio dicts:

bitter turtle Jul 21, 2023, 12:20 AM

#

Can't remember the numbers, those look better?

pallid current Jul 21, 2023, 12:20 AM

#

general takeaways:

l1=0.01 is a nono, dead feats everywhere,
l1=0.003 looks noticeably worse
0.001 and 0.003 look basically the same.
cant see any obvious effect of the l2 bias, might be worth setting it higher in some runs just to see if that does anything
performance looks roughly similar to the original experiment (far left). difference are that these use tied and only a 2x ratio (4x for the original residual stream exps)
seemingly doing a bit better on top/top-and-random but a bit worse on random

bitter turtle Jul 21, 2023, 12:22 AM

#

Ok cool! can do a more fine-grained search tomorrow if that'd be useful

pallid current Jul 21, 2023, 12:22 AM

#

potentially, though there's a huge amount of stuff still to check: larger dict sizes, how it evolves through the epochs, tied ones

bitter turtle Jul 21, 2023, 12:22 AM

#

yep yep

#

How should I think about perf on top Vs top-and-random Vs random?

#

do you have any thoughts on that?

pallid current Jul 21, 2023, 12:26 AM

#

i think random is ultimately the true measure. like if we were able to filter out noise perfectly, and only detect cases where a particular feature was truly active, and then specify it's conditions perfectly then we could theoretically get perfect scores on random, and that's the highest bar

#

openai discuss it at: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html#sec-algorithm-details, interesting comment that i didnt notice before but agree with is "A more principled approach which gets the best of both worlds [top-and-random vs random] might be to stick to random-only scoring, but increase the number of random-only text excerpts in combination with using importance sampling as a variance reduction strategy."

pallid current Jul 21, 2023, 1:03 AM

#

pallid current yeah it just never crossed my mind, my bad, will run it once the current exp is ...

ran this, no noticeable difference, maaaybe a tiny bit higher, will do it by default from now on

keen pivot Jul 21, 2023, 1:50 AM

#

pallid current general takeaways: * l1=0.01 is a nono, dead feats everywhere, * l1=0.003 looks...

I think we should try lower values then, like 8e-4 and lower. I expect 1e-4 to 1e-5 to be identity

#

But lower l1 is better reconstruction, so it’d be good to see the regression to the neuron/residual basis, and just pick the one before it drops off.

Does that make sense?

pallid current Jul 21, 2023, 1:59 AM

#

i see what you mean but i dont agree.. i guess i think the thing that we're trying to do is find good dictionaries, and i think the quality of the features is probably still going up at that point, even though it'll take a bit of a hit to reconstruction_loss

#

i agree we should do some more finegrained tests tho, i'd say restrict to 5e-4 - 5e-3 in future

#

btw am suddenly getting torch.cuda.is_available() == False"???

#

might have to do a big save and restart the node

pallid current Jul 21, 2023, 3:18 AM

#

pallid current might have to do a big save and restart the node

still an issue, deleting a couple of old backups and restarting, backing up everything currently on the server onto /mnt/ssd-cluster, space on the ssd-cluster will be kinda tight after this, have pinging mr waifu for more

keen pivot Jul 21, 2023, 2:20 PM

#

pallid current i see what you mean but i dont agree.. i guess i think the thing that we're tryi...

I agree this is a possibility, but it’d be good to verify with auto-interp.

Those l1 values sound good!

keen pivot Jul 21, 2023, 3:20 PM

#

Okay, I've gotten a little better reconstruction loss (.075->.069) by just training on a larger batch size (256->2048); possibly explained by just training on more data, but I haven't checked.

#

Additionally, we have much lower reconstruction losses for lower l1-values; the far left one does have more MCS > 0.9, but all 3 have decent looking distributions overall.

I can look into the low-reconstruct dictionaries specifically to ensure no identities were learned & sample a few features for interp sake. We can also do auto-interp.

keen pivot Jul 21, 2023, 3:59 PM

#

Yep, it looks like quite meaningful features. Here's the 2000'th highest-MCS one (MCS=0.85)

#

#

@pallid current how many tokens/datapoints are you using to do top-random sampling for hypothesis generation?

#

I want to communicate my surprise that GPT-4 wasn't able to understand the hyphens had numbers before them when I reviewed it's hypotheses yesterday.

bitter turtle Jul 21, 2023, 4:19 PM

#

pallid current i see what you mean but i dont agree.. i guess i think the thing that we're tryi...

maybe we should look at linear VAEs; we still get 'good dictionaries' but can explicitly ignore a degree of 'noise' from less important features

keen pivot Jul 21, 2023, 4:24 PM

#

bitter turtle maybe we should look at linear VAEs; we still get 'good dictionaries' but can ex...

One problem here is that perplexity does go up pretty significantly when replacing w/ our current dictionaries (e.g. 25->100, though I should also check perplexity on some other dataset than pile-10k). So there are important features (or something) that we're missing when training lower-sparsity/higher l1 models.

bitter turtle Jul 21, 2023, 4:25 PM

#

well, I wouldn't expect a transformer to be entirely describable via sparse coding anyway; NNs implement a bunch of different algos in different ways many of which don't involve sparse features

keen pivot Jul 21, 2023, 4:26 PM

#

bitter turtle well, I wouldn't expect a transformer to be entirely describable via sparse codi...

Oh, one of them may be the outlier dimensions, though the model does seem to capture those dimensions quite easily.

bitter turtle Jul 21, 2023, 4:27 PM

#

Imo its better to have fewer high-confidence, really well understood features corresponding to high-importance concepts than to have some bad sparse decomposition of the entire thing

keen pivot Jul 21, 2023, 4:28 PM

#

bitter turtle Imo its _better_ to have fewer high-confidence, really well understood features ...

Agreed, but knowing which ones are better can be done empirically.

bitter turtle Jul 21, 2023, 4:28 PM

#

keen pivot Oh, one of them may be the outlier dimensions, though the model does seem to cap...

Not sure what you mean; I'm more thinking along the lines of models having subcircuits that are not using sparse features at all, and instead use something like the modular arithmetic circuit or something

bitter turtle Jul 21, 2023, 4:28 PM

#

keen pivot Agreed, but knowing which ones are better can be done empirically.

Ah, but I worry that using reconstruction loss is negatively impacting our learning of high-importance features

keen pivot Jul 21, 2023, 4:28 PM

#

bitter turtle Not sure what you mean; I'm more thinking along the lines of models having subci...

Ya, I did want an example actually. How would that circuit not be described by sparse codes?

bitter turtle Jul 21, 2023, 4:29 PM

#

well, like in the paper

keen pivot Jul 21, 2023, 4:29 PM

#

bitter turtle Ah, but I worry that using reconstruction loss is negatively impacting our learn...

That's an understandable concern. What would be the metric that convinces you one way or another?

bitter turtle Jul 21, 2023, 4:30 PM

#

I think you could 'describe' it with sparse codes but they would be bad and not the best fitting description

#

which is kind of what we aim for

keen pivot Jul 21, 2023, 4:30 PM

#

bitter turtle I think you _could_ 'describe' it with sparse codes but they would be bad and no...

Would that be like a piece-wise linear approximation?

bitter turtle Jul 21, 2023, 4:31 PM

#

keen pivot That's an understandable concern. What would be the metric that convinces you on...

I think we should do more counterfactual testing with high-MCS features

#

I can get on that ig

keen pivot Jul 21, 2023, 4:31 PM

#

Lol

bitter turtle Jul 21, 2023, 4:31 PM

#

keen pivot Would that be like a piece-wise linear approximation?

Not sure what you mean

keen pivot Jul 21, 2023, 4:31 PM

#

Wht does counterfactual testing mean here?

bitter turtle Jul 21, 2023, 4:32 PM

#

Like causal scrubbing-type-things. Ablations etc.

keen pivot Jul 21, 2023, 4:32 PM

#

bitter turtle Not sure what you mean

Like approximating x^2 w/ several piece-wise linear functions mostly describes it, but is inneficient and not exact.

keen pivot Jul 21, 2023, 4:32 PM

#

bitter turtle Like causal scrubbing-type-things. Ablations etc.

Is it just like my post or anything else different?

bitter turtle Jul 21, 2023, 4:33 PM

#

Oh, right. Yeah, so you can theoretically probably describe anything with arbitrarily sparse codes but you need lots of them and it would be a really complex and bad encoding so kinda similar yeh

bitter turtle Jul 21, 2023, 4:34 PM

#

keen pivot Is it just like my post or anything else different?

Not sure can't really remember will check again

#

I don't really trust autointerp

keen pivot Jul 21, 2023, 4:37 PM

#

I do like this train of thought. I can see two tasks:

Verify features found by various sparsities (from 5 features/token to 500). This will eventually become the identity.
Given the best dictionary from (1), we can find the greatest perplexity-diff between the original & reconstructed models. (ie run perplexity test on original, then run on reconstructed. Find the datapoints w/ the greatest differences) Those diff points may point towards functions in the model that aren't best represented by sparse codes.

keen pivot Jul 21, 2023, 4:39 PM

#

bitter turtle I don't really trust autointerp

I like how autointerp is able to tell the difference between the basis & the dictionary, so I also expect it to find when the dictionary becomes the basis (when it learns the identity). I do agree that finer-grained measures (ie is this dict better than that one) is uncertain.

keen pivot Jul 21, 2023, 4:39 PM

#

bitter turtle Not sure can't really remember will check again

Would be awesome to have someone else looking at features by hand & figuring out ways to maybe auto-detect polysemanticity?

bitter turtle Jul 21, 2023, 4:50 PM

#

yep will try two haven't done this kind of interp b4 will bug you for help in dms if i end up needing it

#

not sure polysemanticity is a meaningful enough term to do anything better than throwing gpt-4 at it tho

keen pivot Jul 21, 2023, 7:35 PM

#

@bitter turtle desired wandb metrics:

Features/tokens (number of non-zero activations per token on average)
MMCS
Full histogram of MCS

2 & 3 would need to be done every few batches to compare dictionaries learned at different sizes. This may be a headache to do syncing, which if so, maybe just at the end of training.

bitter turtle Jul 21, 2023, 7:58 PM

#

Could do it every chunk (~2M activations, 2k batches) without much hassle

#

Otherwise yeah

#

Not sure what you mean by 1

keen pivot Jul 21, 2023, 8:18 PM

#

bitter turtle Not sure what you mean by 1

dict_levels.detach().count_nonzero(dim=1)).float().mean().item()

#

dict_levels is the latent dimension (ie feature activations)

#

Basically how many features activation for a given token: for the sentence " The cow", it may activate 3 features:

animals
words that start w/ "c"
words that come after " the"

#

Thought: we could do sparse coding on the activations, except for the outlier dimensions.

bitter turtle Jul 21, 2023, 8:35 PM

#

keen pivot ```dict_levels.detach().count_nonzero(dim=1)).float().mean().item()```

I do this already

bitter turtle Jul 21, 2023, 8:36 PM

#

keen pivot Thought: we could do sparse coding on the activations, except for the outlier di...

Good idea

keen pivot Jul 21, 2023, 8:37 PM

#

bitter turtle I do this already

Oh ya, sorry I forgot it!

bitter turtle Jul 21, 2023, 8:41 PM

#

Np

#

Just was confused if you meant something else

bitter turtle Jul 21, 2023, 9:08 PM

#

@pallid current haven't used your autointerp stuff yet tbh, is there a function I can just call to get a score out for doing MCS-to-autointerp-score correlation testing

bitter turtle Jul 21, 2023, 9:34 PM

#

I'm off to bed, if you want to do autointerp on a bunch of the same dict I trained 16x iters for dict ratios 2, 4, 8 for l1=1e-2 on resid stream in multiple_iters_mcs_21_07 @pallid current, otherwise I can try and figure it out tomorrow

pallid current Jul 21, 2023, 9:56 PM

#

bitter turtle <@566946805028225034> haven't used your autointerp stuff yet tbh, is there a fun...

yo, no not like a single function, you can getl like 100 feats with python interpret.py and then in ae_utils.py there's functions for getting the score data into lists etc

#

need to refactor interpret.py to make stuff like that easier, maybe over the weekend

pallid current Jul 21, 2023, 9:57 PM

#

bitter turtle I'm off to bed, if you want to do autointerp on a bunch of the same dict I train...

hmm i think l1=1e-2 might be too high to see much, i think like 90% features are mostly dead at that level

bitter turtle Jul 21, 2023, 10:04 PM

#

Gah

#

Which one do you think is best

#

I was just looking at the w&b logs but don't track dead neurons atm

pallid current Jul 21, 2023, 10:13 PM

#

1e-3 seems a very safe bet

pallid current Jul 21, 2023, 10:14 PM

#

pallid current results from autointerping 12 nontied 2-dict-ratio dicts:

here's what im basing it off

bitter turtle Jul 21, 2023, 10:18 PM

#

Mb forgot that existed 🤦

pallid current Jul 21, 2023, 11:59 PM

#

here's my writeup from the meeting, focusing on potential tests and metrics:

suggests comparing the strength of the ablation effect with that found with neuron basis (i think there might have been more to this but i didn't catch it)
should we compare perplexity to perplexity from replacing with non-sparse coding reconstruction with equivalent reconstruction loss, to see if we're capturing more important directions of variation than we would otherwise expect?
similarly, do we see surprisingly low reconstruction costs if we only restrict to 1 or 2 layers downstream?
can we check whether we're fragile to small quantities of nonsparse data?
can we express reconstruction loss as a proportion of the total variance?
can we see some relationship between the MLP and residual stream where features that are detected by the MLP are then more visible in the residual stream than they were?
can we find examples where the directions we find match up to directions found by e.g. sparse probing (or maybe simple linear combos of a few directions etc)
can we find good feature candidates from our dictionaries by selecting for e.g. variance explained, size/frequency of activation etc?

bitter turtle Jul 22, 2023, 11:04 AM

#

ok i've implemented some standard 'metrics' (MMCS, sparsity hists) on my fork using the shared interface, probbaly needs some tlc to get the plots looking nice, but i have it integrated with the ensembled training code

#

Will also refactor big run to be less cursed/easier to switch to running different experiments if that's something that'd be useful to people

bitter turtle Jul 22, 2023, 11:06 AM

#

pallid current here's my writeup from the meeting, focusing on potential tests and metrics: * s...

Otherwise/after that I'll get on to this list

pallid current Jul 23, 2023, 1:33 AM

#

bitter turtle ok i've implemented some standard 'metrics' (MMCS, sparsity hists) on my fork us...

cheers all looks super good. just tried to run the big_sweep and had a few errors, likely due to merge so have fixed a few lil things on my repo

pallid current Jul 23, 2023, 1:35 AM

#

bitter turtle Will also refactor big run to be less cursed/easier to switch to running differe...

yeah i think it'd be worth taking a slight hit to efficiency to make it easier to customise runs

bitter turtle Jul 23, 2023, 1:07 PM

#

Ok, so this is a weird thing i've noticed. Ensembling is less of a pain when you use a functional interface but it seems that people generally prefer OO interfaces for testing etc, and so I'm ending up writing a lot of weird boilerplate to convert between the two.

#

Normally PyTorch hides all this nonsense but PyTorch also doesn't cooperate well with multiprocessing so I have to be kind of bespoke.

bitter turtle Jul 23, 2023, 2:33 PM

#

@pallid current are you using python 3.8 or 3.10 on the pod?

pallid current Jul 23, 2023, 3:47 PM

#

bitter turtle <@566946805028225034> are you using python 3.8 or 3.10 on the pod?

3.9

bitter turtle Jul 23, 2023, 3:52 PM

#

yeh im just getting some inconsistencies between dependencies etc between our branches like you noticed

#

why did you import GPT2Tokenizer

#

or did you remove that

#

from transformer_lens

#

do you want to do another pip freeze

pallid current Jul 23, 2023, 3:56 PM

#

bitter turtle yeh im just getting some inconsistencies between dependencies etc between our br...

yeah i think i somehow made a mistake in automerging the branches, seemed like 0 conflicts but i lost a couple of changes along the way

pallid current Jul 23, 2023, 3:57 PM

#

bitter turtle why did you import `GPT2Tokenizer`

and yeah in your branch that wasn't being imported but at first i imported it from t_lens instead of transformers

bitter turtle Jul 23, 2023, 3:57 PM

#

afaict not needed, i get errors importing it but w/e

#

@keen pivot did you ever get weirdness where wandb images weren't uploading like half the time

#

oh it just takes ages for some reason nvm ill downsize them

bitter turtle Jul 23, 2023, 4:19 PM

#

anyway IMO it's pretty easy to configure for different runs now
kind of tutorial for configuration: https://github.com/Baidicoot/sparse_coding/blob/main/big_sweep_experiments.py

GitHub

sparse_coding/big_sweep_experiments.py at main · Baidicoot/sparse_c...

Work on sparse coding, replicating and extending the sparse coding approach to taking transformer features out of superposition. - Baidicoot/sparse_coding

bitter turtle Jul 23, 2023, 5:01 PM

#

pallid current here's my writeup from the meeting, focusing on potential tests and metrics: * s...

what do you mean by point 6? Like, translating MLP directions to residual stream space and see if they match up?

pallid current Jul 23, 2023, 5:03 PM

#

i think what i had in mind was like if we have 10 labels for things that an MLP neuron is doing, we would hopefully see that this feature had more of a coherent direction in the residual stream after the MLP has written back to it

#

and do it by probing for this concept using a synthetic dataset and measure the AUROC and degree of separation

bitter turtle Jul 23, 2023, 5:05 PM

#

as a side point, maybe we should clarify our language for the paper about features. I know nora for instance uses 'concept' to refer to the high-level human semantically meaningful thing, and 'feature' to mean 'direction in space corresponding to a concept'

#

(i was momentarily confused by the above)

bitter turtle Jul 23, 2023, 5:06 PM

#

pallid current i think what i had in mind was like if we have 10 labels for things that an MLP ...

labels meaning what?natural language description?

#

also neuron or dictionary direction?

#

also not sure how this relates to sparse coding, no step of that seems to relate to learned features, i've probably misunderstood you (unless you mean dict direction by neuron? in which case couldn't we just determine that by applying the MLP-post-activation-space -> residual stream and checking their similarity? I guess AUROC is better though, but then I don't see what the counterfactual is)

pallid current Jul 23, 2023, 5:10 PM

#

bitter turtle labels meaning what?natural language description?

natural language descriptions of learned features

bitter turtle Jul 23, 2023, 5:11 PM

#

ok but what's the counterfactual? do we ablate those directions in the MLP post-activation?

pallid current Jul 23, 2023, 5:13 PM

#

i guess i'm not sure what role the counterfactual is playing? like in my head if we show that the direction we've found in an MLP, which seems to correspond to a human concept, also lines up with that human concept being more extractable in the model, then it's extra evidence that we're understanding what computation the model is doing in that layer

#

i suppose our test could be more powerful by comparing to a stronger baseline than no increase in concept-extractability

bitter turtle Jul 23, 2023, 5:14 PM

#

ok sure

pallid current Jul 23, 2023, 5:14 PM

#

bitter turtle ok but what's the counterfactual? do we ablate those directions in the MLP post-...

right i get what you mean here

#

yeah that would also be a v good idea

bitter turtle Jul 23, 2023, 5:14 PM

#

i slightly misunderstood you i think

#

I guess like we could also do a regression of the activations of our learned features in the MLP activations and compare that regression's performance to the linear classifier on the residual stream, or something similar, otherwise im still confused as to what you have in mind

pallid current Jul 23, 2023, 5:21 PM

#

ok test i have in mind is:

pick a concept which we think represents one of our learned features in an MLP
use gpt-n to create a synthetic dataset of whether that *concept is on
run a linear classifier on the residual stream before and after the MLP for predicting labels of the synthetic dataset
check whether there is a jump in performance after the MLP
check whether this jump goes away if we ablate the learned direction in the MLP

bitter turtle Jul 23, 2023, 6:04 PM

#

still not entirely sure what 'checking if there is a jump in performance after the MLP' gets us, but I'll get to implementing that

#

other than the synthetic dataset generation

bitter turtle Jul 23, 2023, 6:46 PM

#

also we can probably ask #1102791430549803049 or some other people about what classifiers/metrics/statistical methods are good to use if we want to enpaperify this

bitter turtle Jul 23, 2023, 8:45 PM

#

Was Pierre's intitialisation stuff particularly important or not really?

bitter turtle Jul 23, 2023, 9:11 PM

#

pallid current ok test i have in mind is: * pick a concept which we think represents one of ou...

got this done other than datagen which is the actually hard part

bitter turtle Jul 24, 2023, 1:36 AM

#

It might be useful to plot explained variance vs achieved sparsity maybe

pallid current Jul 24, 2023, 2:24 AM

#

damn, crushing it! will be back on it on monday properly but will have a look now

pallid current Jul 24, 2023, 2:27 AM

#

bitter turtle Was Pierre's intitialisation stuff particularly important or not really?

i've never really understood the case where convergence speed was the constraint which led him to do the initialization stuff, and anyway i understood that it mostly helped the first part of training rather than getting the last bits of performance so i think it's unlikely to be useful. i think in certain toy models it was really slow but in real models it hasn't seemed to be necessary as well see with like good results from models trained in like 15 min. i think it'd be good at some point to run some like 100, 1000 epoch models and check whether there's increased performance tho

pallid current Jul 24, 2023, 5:54 AM

#

made a few little changes to interpret and the save code and it now seems to be able to interpret using the new arch

#

tho i think that it would be best if the outputs were saved in individual folders by default, just makes the processing a little bit easier

#

will hopefully finally get to running more tests on the sweep tomorrow morn

bitter turtle Jul 24, 2023, 7:08 AM

#

pallid current tho i think that it would be best if the outputs were saved in individual folder...

Each model in own file, wdym?

bitter turtle Jul 24, 2023, 11:58 AM

#

maybe interesting plot

#

This is for 32 L1 coefs * 3 dict ratios * 2 repetitions for each

#

On residual stream

#

This may have already been done before idk

#

but the implications feel pretty important

bitter turtle Jul 24, 2023, 2:06 PM

#

same thing with pythia-160m layer 7 (wondering if there would be a meaningful difference between hopefully-quite-different parts of the model)

pallid current Jul 24, 2023, 4:09 PM

#

bitter turtle This is for 32 L1 coefs * 3 dict ratios * 2 repetitions for each

oh damn that's really interesting

#

especially surprising that there's little obvious benefit to the larger dicts

#

how long were they trained?

bitter turtle Jul 24, 2023, 4:10 PM

#

yeah I feel like this is a decent summary stat for comparing different approaches; not sure what literature there currently is on this tradeoff

bitter turtle Jul 24, 2023, 4:10 PM

#

pallid current how long were they trained?

pile10k

#

one epoch

bitter turtle Jul 24, 2023, 4:11 PM

#

pallid current especially surprising that there's little obvious benefit to the larger dicts

for sure

#

I really want to compare to synthetic data now, and synthetic data with some noise

pallid current Jul 24, 2023, 4:12 PM

#

bitter turtle pile10k

is that like 2m activations?? should be a solid amount but would be interested to see if more changes anything

bitter turtle Jul 24, 2023, 4:13 PM

#

yeah about that i think

#

lemmie check

#

about 1.5m

#

I mean it seems to converge pretty fast

pallid current Jul 24, 2023, 4:23 PM

#

bitter turtle same thing with pythia-160m layer 7 (wondering if there would be a meaningful di...

interesting that this one seems to have some benefit to the 8x ratio whereas the first shows literally none

bitter turtle Jul 24, 2023, 4:27 PM

#

yeah not sure if that's that significant

#

more testing with absurd dictionary sizes is probbaly called fro

#

for

pallid current Jul 24, 2023, 4:27 PM

#

wonder what happens as you take the size down

bitter turtle Jul 24, 2023, 4:28 PM

#

hmm

#

why

#

I guess we might be able to better predict stuff if we can see

#

how the frontier changes with dict size and extrapolate?

pallid current Jul 24, 2023, 4:30 PM

#

bitter turtle hmm

to see if there's a clear point at which the marginal dict element stops adding value

bitter turtle Jul 24, 2023, 4:30 PM

#

oh, for sure

#

I'll train some more now i've still got the data etc

pallid current Jul 24, 2023, 4:51 PM

#

bitter turtle Each model in own file, wdym?

yeah i'm imagining that each run (which should have it's own folder as well cos it overwrites by default) gets saved as a folder which has the model in and then can also contain any additional data about that run, + autointerp

#

tho i do see why for like the l1sweep you just did it's easy to just keep them together, like its a question of whether we do more work on them as separate entities or as part of a larger collection

bitter turtle Jul 24, 2023, 4:53 PM

#

I feel like it's not that much of an issue to save it as one big file, but I have moved the hardcoded config out of the sweep function (including output folder)

#

not sure it's that much of an issue, might be nice to store metrics with the models, but we can still just save it as a big list-of-tuples

#

seeing one gpu being weird, not sure what pytorch internals take up compute wise but hmm

WhatsApp_Image_2023-07-24_at_14.52.39.jpeg

pallid current Jul 24, 2023, 4:58 PM

#

bitter turtle not sure it's that much of an issue, might be nice to store metrics with the mod...

true for most metrics. i don't think it's a good file structure for autointerp which creates a massive dataframe for each dict and then a txt file for each neuron. obv the outputs could be put into one file structure but i wouldn't want to have a single file for the dataframes

bitter turtle Jul 24, 2023, 4:58 PM

#

could just put a filename/folder name/whatever

#

with the model

#

also, yikes yeah

#

that made me grimace

bitter turtle Jul 24, 2023, 5:23 PM

#

should pbbly do tests with different underlying dataset sizes to see the relationship but it seems pretty unchanging

#

fadedness is l1

#

because it looks nice

#

kinda hard to distinguish

#

but idc

#

Probably going to implement dead neuron resuscitation tomorrow and see if that changes anything

#

(like, resurrection for the first 5 chunks or something)

#

Hmm

#

Idk someone else theorise

pallid current Jul 24, 2023, 6:15 PM

#

bitter turtle should pbbly do tests with different underlying dataset sizes to see the relatio...

which model/layer?

pallid current Jul 24, 2023, 6:45 PM

#

looking at the same data you had aidan i'm getting the sense that it probably hasnt converged at 10 epochs in terms of maximum sparsity for a given level of unexplained variance

#

or at least there's a jump from like epoch6 to 10

#

weird plot but in those lines of dots what you can see is the sparsity / unexplained variance tradeoff getting better at each epoch

#

might be because we're repeating data tho, i wanna try this setup but with fresh data

keen pivot Jul 24, 2023, 7:14 PM

#

Looks like interesting stuff here, but I'm currently out of commission due to dental work today. Will hopefully be able to catch up/respond tomorrow. Will be meeting w/ Daniel M. today to talk about outlier dimensions.

bitter turtle Jul 24, 2023, 8:43 PM

#

bitter turtle should pbbly do tests with different underlying dataset sizes to see the relatio...

Pythia 160m layer 7

bitter turtle Jul 24, 2023, 8:46 PM

#

pallid current looking at the same data you had aidan i'm getting the sense that it probably ha...

10 epochs or 10 chunks? - but cool! We should start using the actual pile instead of 10k.

pallid current Jul 24, 2023, 8:48 PM

#

bitter turtle 10 epochs or 10 chunks? - but cool! We should start using the actual pile instea...

10 (actually 11) epoches over the pile10k i think, this is just repackaging the data from your runs

keen pivot Jul 24, 2023, 8:49 PM

#

bitter turtle should pbbly do tests with different underlying dataset sizes to see the relatio...

Oh, I like this metric. Only problem not included is the dictionary learning the identity (or unitary matrix) not showing up. But maybe that requires more data. I still need to do my part of showing a dictionary learning it at what sparsity & data amount

bitter turtle Jul 24, 2023, 8:50 PM

#

pallid current 10 (actually 11) epoches over the pile10k i think, this is just repackaging the ...

Wait, there are 11 chunks in that data, still not sure what you mean

#

If it's just the models from my run that's 11 chunks cool ok 👌

pallid current Jul 24, 2023, 8:51 PM

#

oh right is that one pile10k epoch?

bitter turtle Jul 24, 2023, 8:51 PM

#

keen pivot Oh, I like this metric. Only problem not included is the dictionary learning the...

Not sure what you mean; any unitary matrix is pretty nonsparse I think

bitter turtle Jul 24, 2023, 8:51 PM

#

pallid current oh right is that one pile10k epoch?

Yep

pallid current Jul 24, 2023, 8:52 PM

#

ok that's encouraging cos it looks like there's a fair way to go in terms of performance if we crank up the data

#

it would be brilliant if we could consistently associate a point on the sparsity/explained variance space with a level of interpretability

keen pivot Jul 24, 2023, 8:55 PM

#

bitter turtle Not sure what you mean; any unitary matrix is pretty nonsparse I think

I have past examples of above threshold-MCS across multiple l1 values and it’s like a U-shape. I interpret the top-left part of the U to a good disentangling of features (what we want) and the top-right to be the identity

bitter turtle Jul 24, 2023, 8:55 PM

#

pallid current it would be brilliant if we could consistently associate a point on the sparsity...

Also for sure; linear regression time

#

If you want to run autointerp on all-or-some-of-those that would be Cool

bitter turtle Jul 24, 2023, 8:57 PM

#

keen pivot I have past examples of above threshold-MCS across multiple l1 values and it’s l...

Ah, but unitary matrixes show up on the high-sparsity-number tail on this plot

pallid current Jul 24, 2023, 8:59 PM

#

bitter turtle If you want to run autointerp on all-or-some-of-those that would be Cool

i plan to but obv there's a load of them so i need to be thoughtful about how to distribute interpretation

keen pivot Jul 24, 2023, 8:59 PM

#

bitter turtle Ah, but unitary matrixes show up on the high-sparsity-number tail on this plot

Yes I believe this. And I expect interp to negatively correlate with unitary-ness.

pallid current Jul 24, 2023, 9:00 PM

#

i guess if we're doing regressions over it, it doesn't matter that much if the individual measurements for a dict are noisy

keen pivot Jul 24, 2023, 9:00 PM

#

Is there a way to measure unitaryness of our encoder matrix?

bitter turtle Jul 24, 2023, 9:03 PM

#

I don't think we need to.

#

I don't see anything particularly special about unitary matrixes vs other non-sparse dictionaries

#

Like, sure, it's probably more optimal to learn unitary matrices at lower L1 values but that's kind of just emergent. Like I don't see the causality going from unitary -> uninterpretable directions, it's more low sparsity requirements -> entangled features and low sparsity requirements -> unitary matrices.

keen pivot Jul 24, 2023, 9:26 PM

#

bitter turtle Like, sure, it's probably more optimal to learn unitary matrices at lower L1 val...

More like I see the model diverge to a different solution at higher sparsity in a more abrupt way, and I think we can measure that. Maybe that’s unitary, but it’s good to check that abruptness

bitter turtle Jul 24, 2023, 9:28 PM

#

keen pivot More like I see the model diverge to a different solution at higher sparsity in ...

oh, that's just as the sparsity approaches the number of dictionary atoms so it changes behaviour; it probably is learning a kind-of-unitary matrix then, or something similar

pallid current Jul 24, 2023, 9:31 PM

#

i dont think we can meaningfully talk about unitary non-matrices when the number of features exceeds the activation dimension

#

and we know from the fact that the sparsity reaches those v large levels that its not just a unitary matrix + a load of empty rows

bitter turtle Jul 24, 2023, 9:32 PM

#

Could still plausibily be a some rotation-ish-type-thing of that but I don't think it's a particularly good line of inquiry

pallid current Jul 24, 2023, 9:33 PM

#

rotation of zero would be zero, but yeah i'm also not that interested (edit haha fair)

bitter turtle Jul 24, 2023, 9:33 PM

#

Haha

#

Maybe we should actually look at annealing LR

pallid current Jul 24, 2023, 9:45 PM

#

bitter turtle Maybe we should actually look at annealing LR

how're you getting that from recent results?

bitter turtle Jul 24, 2023, 9:46 PM

#

pallid current weird plot but in those lines of dots what you can see is the sparsity / unexpla...

Thinking about how we can get max performance out of this; I think it's probably not that important

pallid current Jul 25, 2023, 3:30 AM

#

been working on getting the synthetic dataset for ablations done, havent had that much time today but getting there, off for a quick bday drink, will finish in the morn 🙂

bitter turtle Jul 25, 2023, 10:27 AM

#

@pallid current all dict sizes have similar numbers of neurons that fire at least once every 10k samples at a given sparsity level; this probably makes comparing autointerp between different dict sizes easier, but is also probably something we definately don't want to be happening

#

one sec let me loglog this for readability maybe

#

uh this is slightly weird

#

I feel like my data is noisy ill up it to 100k

#

okaaaay

#

that... did not change the y scaling at all, the higher sparsity dicts are just learning the same/lower numbers of features/zeroing more features

#

I feel like we may see some changes in these plots comparing tied & norm vs not

keen pivot Jul 25, 2023, 3:05 PM

#

I'm going to write some simple code to find outlier dimensions so we can ignore them, though I may need @bitter turtle 's help for integration (because I'd like to see variance explained vs sparsity).

Current thought:

find outlier dimensions
change mlp_width by # of outlier dims
Index by outlier dimensions when running data through

For variance/sparsity: the outlier dimensions will be 100% explained but account for a permanent +1 in sparsity for every outlier dim.

If this does improve things, we'd need to compare w/ leaving out random dimensions as opposed to outlier ones

keen pivot Jul 25, 2023, 3:44 PM

#

For Pythia_1.4b, there are several dictionary features that activate a large fraction of the time. Percentages of non-zero activations:

tensor([0.5632, 0.4849, 0.4426, 0.2947, 0.2258, 0.2044, 0.1533, 0.1482, 0.1231, 0.1150])
So the first two activate half the time.

keen pivot Jul 25, 2023, 5:44 PM

#

Replicating some "ablating outlier dimensions effects model performance a lot relative to other dimensions", I ablate the top-10 outlier dimensions (ie the residual stream dimensions w/ the highest activations) both on their own & cumulatively. The majority of perplexity diff is caused by ablating the first two dimensions.

Notably, ablating the first two dimensions together causes worse performance than ablating each individually, meaning they have overlapping mechanisms in the model (which mechanism, who knows).

#

What's important for dictionary learning is I can just keep the first 2 outlier dimensions (& hope those dimensions don't also do feature representation), do dictionary learning on the rest of the dimensions.

We can also see what happens if we do dictionary learning on other sets of outlier dims (e.g. just the top outlier, top-5, etc)

keen pivot Jul 25, 2023, 5:50 PM

#

pallid current weird plot but in those lines of dots what you can see is the sparsity / unexpla...

@pallid current, do you have this graph/sweep in your repo?

pallid current Jul 25, 2023, 5:52 PM

#

keen pivot <@566946805028225034>, do you have this graph/sweep in your repo?

it's in sparse_coding_aidan/hoagy_outs, they're the ones with _t on the end

#

script is same folder, frequency_plot_h.py

keen pivot Jul 25, 2023, 5:53 PM

#

pallid current it's in `sparse_coding_aidan/hoagy_outs`, they're the ones with `_t` on the end

Sorry, I meant the code to run it. I want to train dictionaries on the residual stream except for 2 outlier dimensions. I want that graph to compare against baseline.

#

Ah, I think it's frequency_plot.py

pallid current Jul 25, 2023, 5:54 PM

#

pallid current script is same folder, `frequency_plot_h.py`

^

bitter turtle Jul 25, 2023, 7:21 PM

#

keen pivot What's important for dictionary learning is I can just keep the first 2 outlier ...

@keen pivot I plan to look into ways of working around nonsparse data more generally, interested to see what you find with this!

keen pivot Jul 25, 2023, 7:52 PM

#

bitter turtle <@360082080975290369> I plan to look into ways of working around nonsparse data ...

Nothing too great atm! Looks about normal at first glance.

#

This is Pythia-70m, but need to check against normal runs

bitter turtle Jul 25, 2023, 7:56 PM

#

Yeah wasn't expecting it to be too significant; could you put the normal runs when you do them?

keen pivot Jul 25, 2023, 8:43 PM

#

bitter turtle Yeah wasn't expecting it to be too significant; could you put the normal runs wh...

You ever get a:

UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::addcmul_.
error?
& how long does this run take?

bitter turtle Jul 25, 2023, 8:44 PM

#

not an error, just a vmap warning, you can suppress warnings with python -W ignore filename.py, also about 15m maybe?

keen pivot Jul 25, 2023, 9:09 PM

#

bitter turtle Yeah wasn't expecting it to be too significant; could you put the normal runs wh...

#

The "remove outlier dimensions" one is surprisingly much worse. Maybe a coding error on my part.

bitter turtle Jul 25, 2023, 9:23 PM

#

Are you still considering those dimensions when calculating unexplained variance?

#

like, how/where are you ablating them?

keen pivot Jul 25, 2023, 9:23 PM

#

I'm not considering them when calcuating

#

I just remove those dimensions from the batch every time.

#

I'm unsure how much adding back in the outlier dimensions for unexp var. will improve, but the results look a bit dramatic

bitter turtle Jul 25, 2023, 9:26 PM

#

how aremyou removing those dimensions?

#

like, projecting the dimension to zero?

#

Could you send the code?

bitter turtle Jul 25, 2023, 9:43 PM

#

also @pallid current @keen pivot are you guys currently using sparse_coding_aidan for work? It would make sense to have your own (stable) copies of them if so

keen pivot Jul 25, 2023, 9:44 PM

#

bitter turtle also <@566946805028225034> <@360082080975290369> are you guys currently using `s...

Agreed! Is this pushed anywhere?

keen pivot Jul 25, 2023, 9:45 PM

#

bitter turtle Could you send the code?

sample = dataset[sample_idxs].to(torch.float32)
indices = torch.tensor([i for i in range(sample.shape[1]) if i not in outlier_dimensions])
sample = torch.index_select(sample, 1, indices)

bitter turtle Jul 25, 2023, 9:45 PM

#

as of just now yes

#

https://github.com/Baidicoot/sparse_coding

GitHub

GitHub - Baidicoot/sparse_coding: Work on sparse coding, replicatin...

Work on sparse coding, replicating and extending the sparse coding approach to taking transformer features out of superposition. - GitHub - Baidicoot/sparse_coding: Work on sparse coding, replicati...

bitter turtle Jul 25, 2023, 9:47 PM

#

keen pivot ```outlier_dimensions = [111, 156] sample = dataset[sample_idxs].to(torc...

uhh
are you editing the models as well?
would be easier/make more sense just to zero them

#

probably equivalent

keen pivot Jul 25, 2023, 9:56 PM

#

#

They're near equivalent. The only difference is the shape of training across time (the left one is outliers out, and right is original)

keen pivot Jul 25, 2023, 9:58 PM

#

bitter turtle uhh are you editing the models as well? would be easier/make more sense just to ...

Yep, I edited the model as well.

bitter turtle Jul 25, 2023, 10:00 PM

#

keen pivot

oh, slight improvement though!

keen pivot Jul 25, 2023, 10:04 PM

#

bitter turtle oh, slight improvement though!

Ya, may be entirely explained by just getting the outlier features for free.

bitter turtle Jul 25, 2023, 10:06 PM

#

for free?

keen pivot Jul 25, 2023, 10:07 PM

#

bitter turtle for free?

Because I just added the outlier features back in, so it's just adding in an extra 2 0-dimensions which effects the .mean()

bitter turtle Jul 25, 2023, 10:08 PM

#

doesn't look like enough

#

my hypothesis here is that the model dedicates a couple-or-more features to learning these outliers slightly noisily, and looses performance on them. more interested in training where you train the dict on not those directions

keen pivot Jul 25, 2023, 10:11 PM

#

Ah, well the performance gain isn't great. Of course you'd expect it to perform better initially because the "remove outlier" one perfectly reconstructs the outlier dimensions, while the other one is still learning to represent them.

keen pivot Jul 25, 2023, 10:11 PM

#

bitter turtle doesn't look like enough

Is this what you meant by enough? or enough on what metric?

bitter turtle Jul 25, 2023, 10:34 PM

#

to cause the change just by adding in the 0s

#

should I copy the directory for hoagy?

keen pivot Jul 25, 2023, 10:42 PM

#

bitter turtle to cause the change just by adding in the 0s

Can we verify this? so I can run the normal model & 0-out those dimensions in the output for both the datapoint & reconstruction & see if they're the same

bitter turtle Jul 25, 2023, 10:43 PM

#

sure

#

just eyeballing it

pallid current Jul 25, 2023, 10:43 PM

#

bitter turtle should I copy the directory for hoagy?

yo, i'll copy across the file i worked on in your directory, otherwise im working in sparse_coding_hoagy

bitter turtle Jul 25, 2023, 10:44 PM

#

cool

#

I moved to sparse_coding_aidan_new until @keen pivot can move then I'll move back

#

don't want to overwrite your files etc

keen pivot Jul 25, 2023, 10:46 PM

#

#

They look very similar

#

I'm done!

keen pivot Jul 25, 2023, 10:46 PM

#

bitter turtle I moved to `sparse_coding_aidan_new` until <@360082080975290369> can move then I...

Thanks & sorry!

bitter turtle Jul 25, 2023, 10:52 PM

#

np

bitter turtle Jul 25, 2023, 10:52 PM

#

keen pivot I'm done!

done with moving or?

keen pivot Jul 25, 2023, 10:58 PM

#

bitter turtle done with moving or?

Yep! Or really I don't need anything from what I've run

bitter turtle Jul 25, 2023, 11:01 PM

#

oh ok cool ill delete the old folder

pallid current Jul 26, 2023, 12:19 AM

#

think i've got the code in place to do the ablation tests but current being blocked by getting super bad outputs from the simulation which is odd

bitter turtle Jul 26, 2023, 12:20 AM

#

oh cool! good luck with that that sounds awful to debug

#

doing some runs with sparse activations + symmetric noise to see if I can get anywhere close to replicating the weirdness seen here; if not I'll scale down my ambitiousness and test for fragility etc etc etc

pallid current Jul 26, 2023, 12:27 AM

#

toy data?

bitter turtle Jul 26, 2023, 12:52 AM

#

yes

#

promising just need to find scale hopefully but im off to bed

pallid current Jul 26, 2023, 12:53 AM

#

pallid current think i've got the code in place to do the ablation tests but current being bloc...

getting the weirder result that the bad simulated data is the same as the kind of responses we get from the standard interpret.py runs, which succesfully pick out correct explanations

bitter turtle Jul 26, 2023, 12:54 AM

#

not sure wym

pallid current Jul 26, 2023, 12:54 AM

#

like text of [john raised $ 6 million], activation [0, 0, 10, 0, 0], gpt4 interp: "currency symbols, esp $", gpt3.5 simulated data: [0,2,9,9,2]

#

gets a score of like 0.3

#

so the explanation is basically perfect, but the simulated data is pretty crap (actually worse than impression in this example)

#

but it's not like there's a load of free params that go into the simulate function, its super simple

#

also seems like it should be trivial for 3.5

#

hahaha i got such a fright when i tried to look at the prompt as it's built internally because it looks dreadful e.g. six\tunknown\n-\tunknown\nyear\tunknown\n deal\tunknown\n but\tunknown\n Str\tunknown\nud\tu but they just add looads of unknowns into the prompt and then get the logits at every position, its super weird

#

but as far as i can tell it's creating the prompt correctly

#

maybe gpt-3.5 is just quite crap at the task??

#

annoyingly i can't switch to gpt-4 super easily, the instruction models use a different endpoint

keen pivot Jul 26, 2023, 1:37 AM

#

pallid current especially surprising that there's little obvious benefit to the larger dicts

Larger dicts do require more training to show a larger benefit, but the biggest drop in reconstruction loss is definitely caused by L1.

keen pivot Jul 26, 2023, 1:41 AM

#

pallid current getting the weirder result that the bad simulated data is the same as the kind o...

What’s the simulation data supposed to be doing? Or the broader context of the experiment?

pallid current Jul 26, 2023, 6:10 AM

#

(whoops forgot to hit enter) this is for an experiment we'd like to have to verify that the model is in fact using the directions we've found to compute those features. want to check that

the feature is more clearly separated in the residual stream after the MLP layer with the feature than before
this is no longer true if you ablate our feature

#

and for this we need a dataset of 'is the feature on' which is basically what the autointerp stuff already has

bitter turtle Jul 26, 2023, 8:27 AM

#

Is this the same thing as before?

bitter turtle Jul 26, 2023, 10:08 AM

#

had some ideas about using residuals/skip-connections in the multi-layer autoencoder, found out that it is basically this http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf, testing now, looks GREAT so far, some training instability, but I think I just need to vary the lr for the encoder + dict separately (training both with the same LR atm, probably shouldn't, or should train both for a while then freeze dict and finish training the encoder)

bitter turtle Jul 26, 2023, 11:06 AM

#

@pallid current are you running interpret.py? there is ~2GB on GPU0 and I presume that is you

bitter turtle Jul 26, 2023, 12:13 PM

#

bitter turtle had some ideas about using residuals/skip-connections in the multi-layer autoenc...

okay this has like impossibly good performance maybe? running benchmarks soon

#

tf

#

forgot to norm dict 😅

bitter turtle Jul 26, 2023, 1:05 PM

#

this is on pythia-70m layer 2, looks much the same

#

(except for the insanely low-sparsity case)

#

lemmie get a comparison real quick

bitter turtle Jul 26, 2023, 1:25 PM

#

cool, still significantly worse than normal, but at least it is actually training now, unlike before

#

I feel like this is a slight step forward

keen pivot Jul 26, 2023, 2:27 PM

#

bitter turtle I feel like this is a slight step forward

Could you elaborate?

bitter turtle Jul 26, 2023, 2:27 PM

#

well, we couldn't get it to converge at all before, and it is now

#

ideally we want to have a number of different approaches we can try out to find the best one, and this one is pretty close in perf to our best one

pallid current Jul 26, 2023, 3:58 PM

#

bitter turtle <@566946805028225034> are you running `interpret.py`? there is ~2GB on GPU0 and ...

errr might have left a pdb or python interpreter on, sorry

bitter turtle Jul 26, 2023, 4:03 PM

#

we have the reading group thingy today right? in about an hour? @pallid current @keen pivot

pallid current Jul 26, 2023, 4:56 PM

#

@bitter turtle have you looked at what the sparsity / unexplained var graph looks like for non noisy toy data? would be good to be able to show the difference between that and the pythia results, especially if there's a very clear difference

bitter turtle Jul 26, 2023, 4:59 PM

#

bitter turtle cool, still significantly worse than normal, but at least it is actually trainin...

yep, was planning to get around to it, got distracted by this

pallid current Jul 26, 2023, 4:59 PM

#

cool nw

bitter turtle Jul 26, 2023, 5:41 PM

#

can we also do a restart at some point soon, I think there's like 2GB orphaned data just chilling on GPU0

keen pivot Jul 26, 2023, 6:16 PM

#

Do you suppose models across time have more superposition? It's gotta learn it sometime, so maybe we could measure it somehow w/ dicts or maybe a tool more suited for it?

#

@bitter turtle , What's the experiment w/ adding noise? (or is this the de-noising encoder?)

keen pivot Jul 26, 2023, 6:17 PM

#

bitter turtle can we also do a restart at some point soon, I think there's like 2GB orphaned d...

This looks settled now, right?

bitter turtle Jul 26, 2023, 6:17 PM

#

i can check in ~5m wasn't last time

bitter turtle Jul 26, 2023, 6:17 PM

#

bitter turtle promising just need to find scale hopefully but im off to bed

@keen pivot

pallid current Jul 26, 2023, 6:26 PM

#

top looks pretty empty now, i did leave a pdb open overnight so i think it was that sozzz

bitter turtle Jul 26, 2023, 6:27 PM

#

yep looks good now!

#

nw

keen pivot Jul 26, 2023, 6:59 PM

#

bitter turtle doing some runs with sparse activations + symmetric noise to see if I can get an...

What is "the weirdness seen here"?

bitter turtle Jul 26, 2023, 7:12 PM

#

keen pivot What is "the weirdness seen here"?

like, we should expect dictionary learning approaches to be better/there to be a range of l1 values converging on the same solution under the assumption of the activations being well-described by a sparse basis

keen pivot Jul 26, 2023, 7:13 PM

#

bitter turtle like, we should expect dictionary learning approaches to be better/there to be a...

Doesn't MCS capture the "same solution" more accurately?

#

Although they have more nonzero activations (higher sparsity), this could just be allowing more noise.

#

Does that make sense?

bitter turtle Jul 26, 2023, 7:15 PM

#

yes, for sure, but we don't have access to the ground truth so we can't compare to that

#

oh, sorry, I see what you mean

bitter turtle Jul 26, 2023, 7:16 PM

#

keen pivot Although they have more nonzero activations (higher sparsity), this could just b...

not sure about this

bitter turtle Jul 26, 2023, 7:16 PM

#

keen pivot Doesn't MCS capture the "same solution" more accurately?

agree with this

keen pivot Jul 26, 2023, 7:16 PM

#

bitter turtle not sure about this

Agreed. This (ie higher sparsity is just more low-activating noise, not a significant difference in features found) is just one hypothesis. The MCS across different sparsities may better capture this.

bitter turtle Jul 26, 2023, 7:18 PM

#

keen pivot Doesn't MCS capture the "same solution" more accurately?

I still think it's weird. Also, plausibly, there are a bunch of possible sparse decompositions which are equally powerful in terms of describing the data, especially under the assumption of noise

#

I think that we can't really say until I compare to the curve for truly-sparse synthetic data

#

if it turns out that the same curve exists for that, the metric probably can't be used for this kind of analysis

#

I do think that the metric is useful from a pragmatic perspective; in the abscence of one true ground truth, more sparse decompositions are maybe intrinsically valuable lenses to view activations through

keen pivot Jul 26, 2023, 7:19 PM

#

bitter turtle I still think it's weird. Also, plausibly, there are a bunch of possible sparse ...

So you're saying two dictionaries could have similar sparse decompositions & reconstruction loss, but have low-MCS relative to each other?

bitter turtle Jul 26, 2023, 7:21 PM

#

yep maybe

keen pivot Jul 26, 2023, 7:22 PM

#

You could have the sparsity/variance explained graph, but also track MCS across dictionaries. If models do have high-MCS w/ nearby sparsities, then that's good evidence for them converging on the same decomposition.

bitter turtle Jul 26, 2023, 7:23 PM

#

good idea

keen pivot Jul 26, 2023, 7:23 PM

#

Thanks!:)

#

I'm currently working on dictionaries across different layers. I could code up the MCS across layers one w/ your repo tomorrow.

#

I think I'd also want to train models on more data (like 30 chunks w/ pile?), but I think y'all had experiments showing more training data didn't really affect the sparsity/variance explained?, but that's different than MCS.

bitter turtle Jul 26, 2023, 7:26 PM

#

latest results from using the more complex, multi-layer denoiser; looks almost slightly better than dictionaries? number is #chunks

#

changed initialization + switched to GELU

#

will do no-noise test w/ synthetic data + normal methods next

keen pivot Jul 26, 2023, 7:27 PM

#

more complex one is on left?

bitter turtle Jul 26, 2023, 7:27 PM

#

Yep

#

4x dict ratio, haven't tried larger ratios yet

keen pivot Jul 26, 2023, 7:28 PM

#

bitter turtle Yep

Oh, sorry, I forgot discord might change the order of images

More complex one has highest sparsity ~400 on x-axis?

bitter turtle Jul 26, 2023, 7:36 PM

#

More complex one is the one with 8 plots

#

Which dicts are those?

keen pivot Jul 26, 2023, 7:39 PM

#

This is just MCS for Pythia-70m residual across multiple layers

bitter turtle Jul 26, 2023, 7:39 PM

#

Oh, sickkkk: what do the same layer ones look like?

keen pivot Jul 26, 2023, 7:40 PM

#

Ah, it'd be a perfect match? I don't have two same-sized trained dicts of diff-initializations to compare against

bitter turtle Jul 26, 2023, 7:40 PM

#

I think that's a really important baseline to explore

#

Have you tried the testbed thingy yet?

keen pivot Jul 26, 2023, 7:40 PM

#

bitter turtle I think that's a really important baseline to explore

So ture

keen pivot Jul 26, 2023, 7:41 PM

#

bitter turtle Have you tried the testbed thingy yet?

What is that?

bitter turtle Jul 26, 2023, 7:41 PM

#

The standard_metrics.py thing; using a common interface for many different dict types

keen pivot Jul 26, 2023, 7:42 PM

#

bitter turtle The `standard_metrics.py` thing; using a common interface for many different dic...

Nope!

#

I've seen that file, but currently still in my old repo since these dicts are pickles

bitter turtle Jul 26, 2023, 7:43 PM

#

bitter turtle will do no-noise test w/ synthetic data + normal methods next

Oh mb I forgot I'm away this weekend from tomorrow, could one of you two look at doing this if you want it soon? Shouldn't be too hard to edit my big_sweep_experiments.py, I'll push it in a min

bitter turtle Jul 26, 2023, 7:44 PM

#

keen pivot I've seen that file, but currently still in my old repo since these dicts are pi...

If you can, get your training code to use the new one maybe it'd be nice to have common comparisons; I've got most of the way to MCS hists in there so far but haven't yet

keen pivot Jul 26, 2023, 7:55 PM

#

bitter turtle If you can, get your training code to use the new one maybe it'd be nice to have...

Like comparing the MCS hists are the same, and others for the sake of making sure our code is correct?

keen pivot Jul 26, 2023, 7:55 PM

#

bitter turtle Oh mb I forgot I'm away this weekend from tomorrow, could one of you two look at...

Wait, it's like Wednesday. Are you leaving tomorrow?

bitter turtle Jul 26, 2023, 7:57 PM

#

yep

#

Back on sun

keen pivot Jul 26, 2023, 7:59 PM

#

Looks like Hoagy thumbs-up reacted to it, so I'm assuming he's got it handled!

pallid current Jul 26, 2023, 8:01 PM

#

bitter turtle latest results from using the more complex, multi-layer denoiser; looks almost s...

huh that's super interesting, so at 7 chunks, even the highest level of l1 is using 100 feats per token?

pallid current Jul 26, 2023, 8:02 PM

#

keen pivot Looks like Hoagy thumbs-up reacted to it, so I'm assuming he's got it handled!

yeah if its just running existing code but with noise param set to 0, seems chill

keen pivot Jul 26, 2023, 8:04 PM

#

pallid current huh that's super interesting, so at 7 chunks, even the highest level of l1 is us...

How are you getting l1-level from this? L1 is encoded as opacity of color, right?

keen pivot Jul 26, 2023, 8:04 PM

#

pallid current yeah if its just running existing code but with noise param set to 0, seems chil...

Great, thanks for taking care of this!:)

pallid current Jul 26, 2023, 8:10 PM

#

keen pivot How are you getting l1-level from this? L1 is encoded as opacity of color, right...

yeah just tracking the lines to the point of highest L1 /lowest opacity,

keen pivot Jul 26, 2023, 8:18 PM

#

pallid current yeah just tracking the lines to the point of highest L1 /lowest opacity,

Oh, ya. I see it now. Ya that's weird

#

Would be interested to compare the two when training for 10x longer

bitter turtle Jul 26, 2023, 8:26 PM

#

pallid current yeah if its just running existing code but with noise param set to 0, seems chil...

You might need to change stuff around actually I can set it up you can run + debug ig

bitter turtle Jul 26, 2023, 8:26 PM

#

pallid current huh that's super interesting, so at 7 chunks, even the highest level of l1 is us...

I changed the L1 range all others got entirely dead

#

Whoops

bitter turtle Jul 26, 2023, 8:27 PM

#

bitter turtle I changed the L1 range all others got entirely dead

This bit is weird tho

keen pivot Jul 26, 2023, 8:39 PM

#

For MCS across dicts of different sizes (as a baseline that's better, but not as good as dicts of same size/diff init). Notably layer 5 is sucks. Also, layer 2 was trained differently than the others, but I don't have the hyperparams or amount of training data on hand.

pallid current Jul 26, 2023, 8:47 PM

#

is layer 5 just before it goes into the unembedding matrix?

keen pivot Jul 26, 2023, 8:47 PM

#

Yep

pallid current Jul 26, 2023, 8:48 PM

#

that's very interesting, especially that it's not bimodal

#

someone asked me about this recently, like how confident are we in the model where the residual stream is written to and read from, while the mlp calculates updates to the mlp, but doesn't really hold much of the information

#

i think the success of the logit/tuned lens is the main point of evidnce for it but i couldnt think of much other emprical evidence

#

i think if that were true and sparse coding was working perfectly we would expect bimodality

bitter turtle Jul 26, 2023, 8:51 PM

#

not sure what it would mean for the mlp activations to hold information in a way that would meaningfully invalidate that model

#

the residual stream is the data moving between layers

bitter turtle Jul 26, 2023, 8:52 PM

#

pallid current i think if that were true and sparse coding was working perfectly we would expec...

Throughout the model?

pallid current Jul 26, 2023, 8:54 PM

#

so like, when you calculate resid4 = resid3 + mlp3 + attn3, the assumption is that the mlp does not contain most of the information in resid3, its instead calculating smaller volumes of new information. and therefore, for information to persist between layers, it must be present in resid4 in the same form that it's present in resid3

#

whereas, if mlp3 contained most of the info, there would be nothing preventing resid4 from having a very different representation to resid3

bitter turtle Jul 26, 2023, 8:55 PM

#

right, volumes of information, I see.

#

I feel like the distribution of bimodality across layers is very dependent on the way the gradients flow through and I can't model that that well; I think there was a paper looking at something ~this

#

You should ask the tuned lens people

#

how many reinitialisations have you done @keen pivot

#

I've asked in #interpretability-general

keen pivot Jul 26, 2023, 9:12 PM

#

pallid current so like, when you calculate resid4 = resid3 + mlp3 + attn3, the assumption is th...

My brain is sliding off this. What are the two different hypotheses being compared here & the evidence for either (and bimodality of what? MCS across layers?). My attempt:
H1: residual stream is the memory of the model that is written to & read from by e.g. MLPs. Each MLP only does a small change to the residual stream, so the representation should be mostly the same across layers.
H2: Most of the information goes through MLPs, so we shouldn't expect similar representations across layers

#

Oh, and for the record, I also worked on tuned lens

bitter turtle Jul 26, 2023, 9:14 PM

#

oh sick

#

oh, logan smith

#

is that you?

bitter turtle Jul 26, 2023, 9:15 PM

#

keen pivot Oh, and for the record, I also worked on tuned lens

totally did not realise, sorry 😅

keen pivot Jul 26, 2023, 9:17 PM

#

bitter turtle totally did not realise, sorry 😅

It's what I get for going by different names, haha

#

Okay, the across dict stuff looks like I need to do ACDC ablation stuff:

Find a cool feature in layer 5 (check)
Ablate all features one-at-a-time in layer 4 & sort by drop in feature activation in 5 (could also ablate all features, and then restore one-at-a-time)
Investigate those features found
Repeat for features found in layer 4.

bitter turtle Jul 26, 2023, 9:29 PM

#

please do that that would be fucking sick

#

as in, would look amazing on a paper

keen pivot Jul 26, 2023, 9:30 PM

#

I should be able to do in like 4-5 hours, which will be a tomorrow thing. I'm about to go watch barbie though.

bitter turtle Jul 26, 2023, 9:30 PM

#

it's v good enjoy

#

@pallid current ran the test on no-noise, all l1 values (tested over the normal range here, same as the last one) converge to low-sparsity solutions (btw the sparseness measure I built for wandb is broken atm) as we predicted. falloff for lower-sparsity solutions is similar

#

this is Quite Weird.

#

like, the really low l1 ones still converge to sparse solutions, I think this is maybe just a search-not-wide-enough thing

pallid current Jul 26, 2023, 9:44 PM

#

riiight so in that case the sparsity at about 7.5 is like the true amount?

bitter turtle Jul 26, 2023, 9:44 PM

#

still, I think it is ~what we were expecting

bitter turtle Jul 26, 2023, 9:44 PM

#

pallid current riiight so in that case the sparsity at about 7.5 is like the true amount?

10 but yeah

pallid current Jul 26, 2023, 9:44 PM

#

ok so yeah seeing super big differences in the role of noise

bitter turtle Jul 26, 2023, 9:44 PM

#

phew

#

this is progress then

#

I had an idea for avoiding noise; instead of peanalising l2-norm of residuals, we should peanalise cross-entropy against a normal distribution with a learned covariance matrix (and also probably peanalise the size of the covariance matrix)

#

this might probably turn out to be equivalent, but who knows. probably people do, but I'm not them

#

I expect this to be equiv

pallid current Jul 26, 2023, 9:49 PM

#

hmmm so the only loss signal would be from our ability to fit to this learned cov matrix?

#

so basically at that point the learned cov matrix is assumed to encode all of the important information?

bitter turtle Jul 26, 2023, 9:50 PM

#

yeah, this is assuming that 'that which is not sparse is ~normally distributed'

pallid current Jul 26, 2023, 9:50 PM

#

i feel like that would throwaway most of the important info, because it wouldn't be able to represent spikes in the distributions properly. like openai's approach is basically to take maximum distance away from the normal

#

and i think you'd still want to do that even if you had a learned cov matrix

#

tho i guess that's the diff

bitter turtle Jul 26, 2023, 9:51 PM

#

oh, so, we learn a sparse dict, but replace the 'minimise residuals' thing with 'make the residuals fit a normal distrinution with low variance'

pallid current Jul 26, 2023, 9:52 PM

#

riiiiiight sorry i getcha

#

seems rather convoluted but it guess it could work. for me the question is, even if this is what we'd expect to see in practice, is there a reason that we think that minimizing l2 norm is the wrong thing to aim for?

#

like there are cases where fitting the normal dist is wrong, because that variation can in fact be captured

#

whereas i'm struggling to picture the case when performance would get worse by trying your best to minimize l2, even if some noise is irreducible

bitter turtle Jul 26, 2023, 9:56 PM

#

pallid current seems rather convoluted but it guess it could work. for me the question is, even...

I'm skeptical because like linreg works with minimising squares, and that works with normal noise, so why shouldn't this.

#

yeah pretty much agree

pallid current Jul 26, 2023, 9:57 PM

#

interesting idea tho, shades of VAE about it which also made me go ?? at first but works

bitter turtle Jul 26, 2023, 9:59 PM

#

desmos crushes my dreams once again

#

this is literally quadratic

#

I guess this should be expected, linreg people know what they're doing

pallid current Jul 26, 2023, 10:01 PM

#

linreg mafia undefeated

bitter turtle Jul 26, 2023, 10:01 PM

#

for real

#

I could literally have done this in my head im dumb

bitter turtle Jul 26, 2023, 10:04 PM

#

pallid current interesting idea tho, shades of VAE about it which also made me go ?? at first b...

might be something in the VAE thing maybe

#

might be slightly more stable to train/converge better otherwise no ideas atm

bitter turtle Jul 26, 2023, 10:08 PM

#

bitter turtle might be something in the VAE thing _maybe_

actually probably not

bitter turtle Jul 26, 2023, 10:40 PM

#

tied vs untied complex multi-layer denoiser; seems about the same perf

pallid current Jul 26, 2023, 11:00 PM

#

bitter turtle tied vs untied complex multi-layer denoiser; seems about the same perf

seems like that kind of pattern is quite robust.

bitter turtle Jul 26, 2023, 11:02 PM

#

yep

pallid current Jul 26, 2023, 11:03 PM

#

btw got it working so i can run interp over a gigantic .pt of learned dicts

#

still the q of exactly what to run it over

bitter turtle Jul 26, 2023, 11:03 PM

#

awesome

#

tied dicts are a bit weird on this ???

pallid current Jul 26, 2023, 11:04 PM

#

?? what's the diff i thought the last graph was also tied

#

looks buggy

bitter turtle Jul 26, 2023, 11:05 PM

#

pallid current ?? what's the diff i thought the last graph was also tied

the last one was with a multi-layer encoder

#

I got that converging correctly this morning, it gets slightly better performance than normal at the cost of training speed

bitter turtle Jul 26, 2023, 11:06 PM

#

bitter turtle the last one was with a multi-layer encoder

specifically, a 3-layer one with residual skip connections; turns out residual connections was all you needed to get it to converge

#

fixed the bug

#

(wasn't telling the dict to be normed - again - when I was saving them) @pallid current

#

meant I lost first batch checkpoint

#

cool, that looks to be slightly better than untied!

bitter turtle Jul 26, 2023, 11:32 PM

#

pushed current code, off for the weekend

pallid current Jul 27, 2023, 12:54 AM

#

have fun!

pallid current Jul 27, 2023, 12:55 AM

#

bitter turtle the last one was with a multi-layer encoder

oh i misunderstood the graphs above, but then how can a multilayer encoder be tied?

pallid current Jul 27, 2023, 5:43 AM

#

ugh getting cuda.is_available = False again :/

#

no idea what's triggering it

pallid current Jul 27, 2023, 6:41 AM

#

will restart in the morning if it's not magically fixed and have messaged curtis

#

feel free to restart if you need logan

bitter turtle Jul 27, 2023, 6:45 AM

#

pallid current oh i misunderstood the graphs above, but then how can a multilayer encoder be ti...

So, it converges only with skip connections, was just messing around with different configurations and it seems to converge best (utterly untested) when it's just embedding linear transformation -> denoising layers with skip connections -> bias + ReLU rather than embedding linear transformation -> denoising -> another linear map -> bias + ReLU, so I slapped W_T as the first linear map

pallid current Jul 27, 2023, 6:48 AM

#

oh hey, gotcha

#

btw is the dense_l1_sweep output saved somewhere, the untied ones?

bitter turtle Jul 27, 2023, 6:48 AM

#

not atm

pallid current Jul 27, 2023, 6:49 AM

#

ah k

bitter turtle Jul 27, 2023, 6:49 AM

#

p sure you can just change the call and run it tho

#

Shouldn't be too much reconfig

pallid current Jul 27, 2023, 6:49 AM

#

yeah will do once i restart the kernel

#

tho in general would best to save exp funcs cos there's quite a lot of params just set in __main__

bitter turtle Jul 27, 2023, 6:55 AM

#

sure

#

I'm just using pythia-70m layer 2 residual for everything ATM, literally just the data in activation_data

#

it's like the older training setup in that you can lie to it and just change the dataset folder and it won't check that it's the right dataset folder

bitter turtle Jul 27, 2023, 11:48 AM

#

I also doubled the batch size, I trained the other ones with 1024

keen pivot Jul 27, 2023, 2:02 PM

#

bitter turtle specifically, a 3-layer one with residual skip connections; turns out residual c...

Also try a Layer Norm? Basically a Transformer w/o attention

bitter turtle Jul 27, 2023, 3:37 PM

#

that's not even a bad idea wth

#

That would be mildly funny if that worked

keen pivot Jul 27, 2023, 3:39 PM

#

It's also sounding more like the soft-prompt literature, which began simple, but expanded to more transformer-like models to train the soft-prompts.

bitter turtle Jul 27, 2023, 3:40 PM

#

I mean, you probably actually wouldn't want to do that, you need to propagate magnitudes through, and +ve activations definitely aren't centered, but it would be funny

bitter turtle Jul 27, 2023, 3:42 PM

#

bitter turtle So, it converges only with skip connections, was just messing around with differ...

Like, I think this is mildly principled in that it probably inherits some of the convergence properties of linear encoders but is slightly more powerful

#

layernorm would kill that

#

I don't think the trade-off is good enough to spend a lot of energy looking into it though, maybe if we get stuck after training on lots of data

#

like, potentially once the dict converges with linear encoders we could freeze it and train this to do better sparse coding for that dict, and maybe iterate, but that's a while off. More interested in looking at the shittons-of-data case atm

keen pivot Jul 27, 2023, 4:55 PM

#

bitter turtle like, potentially once the dict converges with linear encoders we could freeze i...

Is that just bottlenecked on loading in Pile & adding more chunks to train through?

bitter turtle Jul 27, 2023, 7:40 PM

#

yes, we should do it soon

keen pivot Jul 27, 2023, 9:56 PM

#

Got a graph of related features. The original one is words in parantheses related, & the others appear to be similar as well. Will look into the details soon

pallid current Jul 27, 2023, 9:57 PM

#

😮

keen pivot Jul 27, 2023, 9:57 PM

#

I also want to cluster feature directions as well. In general, and here we could color-code features by their similarity, because some of these may just be the same direction across layers.

pallid current Jul 27, 2023, 9:58 PM

#

bitter turtle like, potentially once the dict converges with linear encoders we could freeze i...

not sure i understand this but the codebase is already set up to do larger runs

#

currently running a 100-chunk sweep on the pile with the dense l1 sweep

keen pivot Jul 27, 2023, 10:00 PM

#

Note: this too 6 minutes, which isn't too long, but will get longer w/ larger models/more layers

pallid current Jul 27, 2023, 10:00 PM

#

how are you running this?

bitter turtle Jul 27, 2023, 10:05 PM

#

pallid current currently running a 100-chunk sweep on the pile with the dense l1 sweep

Sick can't waiit

pallid current Jul 27, 2023, 10:16 PM

#

definitely seeing diminishing returns, up to 20 epochs atm and it's still improving but barely

#

#

continued improvement is most noticeable at low tokens per activation_v tho so might still see a decent jump by 100 epochs

#

this is with dict_ratio = 4, so we can also see if this looks different with higher dict ratios

bitter turtle Jul 27, 2023, 10:19 PM

#

what is 'low tokens per activation_v'

pallid current Jul 27, 2023, 10:19 PM

#

just low sparsity (on the graph) , except that is what we'd usually call high sparsity so its confusing

bitter turtle Jul 27, 2023, 10:20 PM

#

yep

pallid current Jul 27, 2023, 10:20 PM

#

clearly i didnt improve the situation 😆

bitter turtle Jul 27, 2023, 10:20 PM

#

call the x-axis thing 'sparsity number' maybe

pallid current Jul 27, 2023, 10:21 PM

#

think i might go for 'average active features'

#

dont know why i keep calling features tokens

bitter turtle Jul 27, 2023, 10:22 PM

#

yeah that was confusing

keen pivot Jul 27, 2023, 10:23 PM

#

pallid current dont know why i keep calling features tokens

I like average features/tokens

bitter turtle Jul 27, 2023, 10:24 PM

#

pallid current continued improvement is most noticeable at low tokens per activation_v tho so m...

Yeah the tails seem to be converging to the center which is maybe good and promising; actually no, the right tail is converging to 600 or so features, that's awesome, we should check the mcs of that

keen pivot Jul 27, 2023, 10:24 PM

#

pallid current how are you running this?

On GPU. But you have to make features by layers amount of causal interventions, so can't really batch that. Could run different paths on diff GPUs

bitter turtle Jul 27, 2023, 10:25 PM

#

Didn't someone speed it up by like 200x using activation patching or whatever it's called recently?

#

Think they were in the UK seri mats cohort

keen pivot Jul 27, 2023, 10:28 PM

#

keen pivot Got a graph of related features. The original one is words in parantheses relate...

The left is acronyms & right-two paths are dates related

#

Looks like overall, this direction in the last layer wants to up-weight an end-paranthesis, and there's two paths: end of acronyms & end of dates after an opening paranthesis

bitter turtle Jul 27, 2023, 10:30 PM

#

are there established metrics for goodness-of-graph? thinking we could use something atticus gieger-like to measure how descriptive graphs we find using the sparse basis are compared to how descriptive graphs we find on the neuron basis

keen pivot Jul 27, 2023, 10:30 PM

#

bitter turtle Didn't someone speed it up by like 200x using activation patching or whatever it...

Oh ya, probably! I'd integrate that. That'd make the feedback loops much better

bitter turtle Jul 27, 2023, 10:30 PM

#

bitter turtle are there established metrics for goodness-of-graph? thinking we could use somet...

Like icl this definitely can be a paper if it turns out that these graphs are computationally meaningful

keen pivot Jul 27, 2023, 10:31 PM

#

bitter turtle Like icl this definitely can be a paper if it turns out that these graphs are co...

This is residual stream. It'd be great to connect it to MLP's

bitter turtle Jul 27, 2023, 10:31 PM

#

keen pivot This is residual stream. It'd be great to connect it to MLP's

Oh, I know, giegers work can still be applied here

#

for sure want to connect to MLP as well tho

pallid current Jul 27, 2023, 10:32 PM

#

bitter turtle Oh, I know, giegers work can still be applied here

does he have an explicit goodness of graph metric?

bitter turtle Jul 27, 2023, 10:34 PM

#

I think he has a metric for 'alignment of causal hypothesis to model' and so basically we

find a graph with adcd or whatever the acronym is
come up with a hypothesis for what each node in the graph represents in an abstract computational model of the circuit
throw the metric at it, which compares the abstract model to the model in the transformer

#

if we can find 'natural' circuits/graphs that are well described by abstract high-level causal models that would literally be fucking insane

#

I was thinking today about what kind of things we really want for a draft paper if we go that route and demonstrating the ability to find circuits in models using the sparse basis is for sure up there like number 1 priority

keen pivot Jul 27, 2023, 10:47 PM

#

I guess it's the "computational" part that doesn't work, but I do think you can still do causal alignment here.

#

Expecially w/ the speedup, I could quickly find really good examples to test.

#

Slightly different graph: This is for feature restoration. Basically, I set the activation to 0 & check how restoring that feature accounts for the original activation.

bitter turtle Jul 27, 2023, 10:49 PM

#

keen pivot I guess it's the "computational" part that doesn't work, but I do think you can ...

What do you mean; we can still find computational graphs using purely residual stream data. Gieger's stuff is pretty implementation-agnostic, it just checks values of 'variables'/intermediate points in computation/values at nodes of the compute graph, it doesn't care how the computation is actually implemented between nodes much (IMO this is fine and also a good thing)

#

Also, we should probably be using FISTA/OMP etc for doing compute stuff

#

Or like, generally some better solver than dot + bias

bitter turtle Jul 27, 2023, 11:21 PM

#

keen pivot Slightly different graph: This is for feature restoration. Basically, I set the ...

What do the % here represent?

keen pivot Jul 28, 2023, 1:43 AM

#

bitter turtle What do the % here represent?

Percentage recovered or ablated. This case percent recovered.

So if the original activation was 5, and we ablate everything and recover one feature, how much of the original activation do we recover?

pallid current Jul 28, 2023, 2:35 AM

#

ran another long run with l1 sweep, this time tied, seems very definitively no difference, to the point where i'm checking i'm not just plotting the same data twice (don't think i am) (crosses are tied, legend is n_chunks)

pallid current Jul 28, 2023, 2:53 AM

#

set off a big run to analyse 50 feats from all 32 tied l1 values, about $350, no idea how long it'll take, i guess a few hours?

bitter turtle Jul 28, 2023, 7:10 AM

#

bitter turtle I think he has a metric for 'alignment of causal hypothesis to model' and so bas...

This might be a good objective measure, not sure about how many comparisons there are/what baselines we could use, maybe we should ask Neel or someone about relative measures

pallid current Jul 28, 2023, 7:48 AM

#

autointerp results from the sweep are odd and kinda worrying. not seeing the rise from low to mid l1 values that i expect, or at least not as robustly. then goes off the rails at we get to high l1 vals. need to adjust the approach to increase the number of features analysed to correct for the fact that most will be dead or v rarely active

#

need to run l1 = 0 runs to see if they are much worse than l1 = 1e-4

#

i'm quite worried by the fact that 1e-4 is so good

#

pallid current Jul 28, 2023, 8:13 AM

#

have left a run ongoing while going to bed, will almost certainly hang at the end due to wandb issues, if its preventing anyone from doing a run, just kill

keen pivot Jul 28, 2023, 4:10 PM

#

bitter turtle This might be a good _objective_ measure, not sure about how many comparisons th...

I agree this is a good measure. I don’t think we really need a baseline beyond “it doesn’t work in neuron basis”

bitter turtle Jul 28, 2023, 4:14 PM

#

Ah, but to measure it we need to

make a choice of which graph to look at
make a hypothesis of which algorithm the graph implements
which seems hard to make standardised when comparing neuron basis and sparse basis

#

I mean you can probably standardise the first one fine

#

Hmm maybe it's ok

keen pivot Jul 28, 2023, 4:16 PM

#

pallid current i'm quite worried by the fact that 1e-4 is so good

What’s the neuron basis score here?

And I predict the identity to be learned around 1e-5, which is usually like 500-600 sparsity, though I’m confused about the graph. Is 1e-4 corresponding to a sparsity of 800?

keen pivot Jul 28, 2023, 4:16 PM

#

bitter turtle Ah, but to measure it we need to - make a choice of which graph to look at - mak...

Ya. I think if we just get this to work in a real LM, then we’re golden

#

Im also making different choices to make the graph (eg ablation vs restoration), which can be compared against each other as well

pallid current Jul 28, 2023, 6:22 PM

#

keen pivot What’s the neuron basis score here? And I predict the identity to be learned a...

#

trend seems about right if you just look to l1=1.6e-4 but those last few ones around 1e-4, esp 1e-4 are just weird

#

about to run 0, 1e-7, 1e-6 and 1e-5

#

btw im pretty sure that we're overloading wandb when we do our final upload, leaving it to timeout indefinitely, i think for big runs for now we should just turn it off - i haven't been looking at it at least

keen pivot Jul 28, 2023, 6:29 PM

#

@pallid current, would you also be able to see MCS between different l1-value dictionaries? If there are N l1-values in the graph, this could be shown as the lower-triangle of an NxN matrix.

#

I said I'd get to it yesterday, but doing the causal alignment stuff atm.

#

I also want to do the outlier features, but out-of-scope for this project. Might try to pawn it off.

pallid current Jul 28, 2023, 7:01 PM

#

keen pivot <@566946805028225034>, would you also be able to see MCS between different l1-va...

yeah good shout will do in a sec

pallid current Jul 28, 2023, 9:18 PM

#

ran the mmcs matrix, got some very simple plots running in the notebook mmcs_plots.ipynb in my workspace

#

most interesting thing i can see is that there's some kind of peak around l1=0.001 where even the highest l1 values near l1=0.01 match most closely to 0.001, i guess because those are the features they would learn, if they weren't mostly dying

#

e.g.:

#

peak mmcs in the whole matrix is just above 1e-3

#

low mmcs match best with other low mmcs though there is a noticeable hump around 2e-3

#

#

oh also got the results back for the super low l1 baselines. they dont outperform neuron_basis on random but do slightly for top and top-random, so there's something just by the nature of putting the data through a relu that is causing some level of screening, need to be careful about this when making claims and baselining

#

keen pivot Jul 28, 2023, 10:07 PM

#

pallid current ran the mmcs matrix, got some very simple plots running in the notebook `mmcs_pl...

Are you able to save this as a matrix for all l1 values & plot a heatmap of the matrix?

pallid current Jul 28, 2023, 10:19 PM

#

heat map, tho i think it's quite hard to read

bitter turtle Jul 28, 2023, 10:23 PM

#

Do you have a legend?

pallid current Jul 28, 2023, 10:27 PM

#

keen pivot Jul 29, 2023, 12:45 AM

#

pallid current

Thanks for doing this!

#

There do appear to be two clusters here

hallow wyvern Jul 30, 2023, 12:58 AM

#

Hey, coming here from the mech interp discord where you presented this week. Amazing stuff.

I was looking into using your techniques for interpreting the TinyStories series of models, but as the first step in doing that I'm trying to come as close to reproducing your results with your codebase as I can. Couple of assumptions I'm making on my end while trying to do that which I am not sure are correct:

The canonical repo for this work is https://github.com/HoagyC/sparse_coding.git : This one seems to be ahead of all of its forks
The canonical way to run this code is to run python run.py <args> : This is what it says in README.md, but I do note a lot of recent activity in the files big_sweep.py and big_sweep_experiments.py. I also see changes to interp_notebooks/feature_interp.ipynb that are more recent than the changes to run.py, and the definition of class AutoEncoder is different in each one.
The canonical way to verify that the artifacts are finding real feature directions is to run interp_notebooks/feature_interp.ipynb.

Asking because I did a run with python run.py --epochs=3 --save_after_mini=True --l1_exp_low=-14 --l1_exp_high=-10 --dict_ratio_exp_low=1 --dict_ratio_exp_high=7 --layer=2 --use_residual=True --use_wandb=True --wandb_entity=<my_name>, and it did generate outputs in outputs/20230729-195836/0/auto_encoders_2.pkl, but the results of interp_notebooks/feature_interp.ipynb seemed a bit off (after I did some extremely sketchy stuff to get that notebook to run at all).

Not urgent, I'm trying out seeing what happens when I use autoencoders.tied_ae.AutoEncoder in run.py to see if that helps

pallid current Jul 30, 2023, 4:20 AM

#

hey 🙂 yeah that's the right repo and those arguments look reasonable. it's in a funny state because run.py was the original code to run but we've been doing a lot of large hyperparam sweeps with mutiple GPUs and Aidan rewrote the code with a very different architecture so i barely know what run.py does at this point, i'm happy to talk through for a bit if it's not really working

#

we should do a cleanup that allows a simple run soon

#

but if l1 and reconstruction loss are both falling then it looks like it should be working ok but i don't knw the status of feature_interp.ipynb, @keen pivot can you help?

hallow wyvern Jul 30, 2023, 5:05 AM

#

If sweep(ensemble_init_func, cfg) is the up-to-date method for running one of these experiments, I can write an ensemble_init_func 🙂

hallow wyvern Jul 30, 2023, 5:23 AM

#

well that looks promising

#

https://github.com/HoagyC/sparse_coding/compare/main...JoshuaDavid:sparse_coding:main#diff-9ef165e04d52c6850bd88e31d401d1eb218baf46492f5974292eb6d32f739350 is my initial crack at an ensemble_init_func which will do the same thing the old run.py did (minus the "compute and store activations if none exist" step). Currently running via

$ python run_using_sweep.py --epochs=1 --save_after_mini=True --l1_exp_low=-12 --l1_exp_high=-10 --dict_ratio_exp_low=1 --dict_ratio_exp_high=4 --layer=2 --use_residual=True --use_wandb=True --datasets_folder=activation_data/pile-10k-EleutherAI/pythia-70m-deduped-2/ --wandb_entity=joshuadavid

ETA is ~7 more minutes (I'm running on an instance with only one GPU)

Edit: there were two chunks. ETA is actually still <t:1690696320:R>

hallow wyvern Jul 30, 2023, 5:52 AM

#

Run finished without errors at least

hallow wyvern Jul 30, 2023, 7:16 AM

#

pallid current hey 🙂 yeah that's the right repo and those arguments look reasonable. it's in a...

Yeah, it looks like using the stuff in big_sweep.py was probably the way to go. I'm still a little bit stuck in feature_interp.ipynb, since I'm not entirely sure how to convert an autoencoders.learned_dict.UntiedSAE into an AutoEncoder (or even if that's something I should be doing). But. It looks like it did something that is approximately what was done before.

When I take the size-2048 dictionary, and plot the pairwise cosine similarity of features in that dict (excluding pairs which are the same feature, i.e. feat_0 x feat_1, feat_0 x feat_2, ... feat_0 x feat_2047, feat_1 x feat_2 ... feat_2046 x feat_2047 but none of feat_0 x feat_0), I get a nice normalish-looking distribution centered around a cosine similarity of 0. And when I take the size-1024 dict and the size-2048 dict, and do the MCS thing there, there are some features that are learned by both. Though not as many as I might hope. Graphs in question, as well as the script to generate them, attached.

Anyway, I have a suspicion that the issue here is just that I trained on all of 2 chunks for 1 epoch. I'll try setting up a 5 epoch run on 30 chunks overnight, see if that gets better results.

# Terrible hack of --epochs=0 to get the chunks into activation_data without having to use anything else in run.py
python run.py --epochs=0 --n_chunks=30 --save_after_mini=True --l1_exp_low=-13 --l1_exp_high=-12 --dict_ratio_exp_low=1 --dict_ratio_exp_high=2 --layer=2 --use_residual=True --use_wandb=False
# And retrain overnight
python run_using_sweep.py --epochs=5 --save_after_mini=True --l1_exp_low=-12 --l1_exp_high=-10 --dict_ratio_exp_low=1 --dict_ratio_exp_high=5 --layer=2 --use_residual=True --use_wandb=True --datasets_folder=activation_data/pile-10k-EleutherAI/pythia-70m-deduped-2/ --wandb_entity=joshuadavid

📎 debug_sweep_outputs.py

keen pivot Jul 30, 2023, 2:10 PM

#

hallow wyvern Yeah, it looks like using the stuff in `big_sweep.py` was probably the way to go...

Training on more data should help. But I’m curious what sparsity you’re getting in wandb? Which means how many features/token you’re getting.

#

I think the minimal_feature_interp on my repo should be better. One caveat: if you’ve saved it as a pickle, my code will work. If not, you’ll need to replace the pickle load with torch.load()

#

Ya the MCS histogram should look much better than that (again, more data, and check the l1’s effect on sparsity)

hallow wyvern Jul 30, 2023, 5:47 PM

#

keen pivot Training on more data should help. But I’m curious what sparsity you’re getting ...

Looks like sparsity shows as 200 in wandb for l1=1e-3, dict_size=2048? If I'm interpreting that right

keen pivot Jul 30, 2023, 5:50 PM

#

hallow wyvern Looks like sparsity shows as 200 in wandb for l1=1e-3, dict_size=2048? If I'm in...

Yep, that looks reasonable. I think a better sparsity is around 20-50, but that’s still up for debate.

#

That would mean upping the l1 term.

#

Are the pictures from retraining overnight on more data?

#

I’d be curious to see the results for MCS across two dicts

hallow wyvern Jul 30, 2023, 5:55 PM

#

That's from the runs I looked at last night, haven't looked at results from overnight runs yet

Edit: speaking clearly is hard

keen pivot Jul 30, 2023, 5:58 PM

#

Ah,what size is the model?

hallow wyvern Jul 30, 2023, 6:13 PM

#

One of the runs in the sweep was a dictionary of size 2048 with l1 of 1.78e-3, wandb says sparsity of ~65.

Model is pythia-70m-deduped-2

keen pivot Jul 30, 2023, 6:24 PM

#

hallow wyvern One of the runs in the sweep was a dictionary of size 2048 with l1 of 1.78e-3, w...

Have you checked the MCS of this one?

hallow wyvern Jul 30, 2023, 6:25 PM

#

doing that now

#

hm that's not good:

keen pivot Jul 30, 2023, 6:47 PM

#

Hmmm… I can look into it in more detail tomorrow. Definitely doesn’t look right.

Random vectors would be around 0.4 I think for MCS so it’s kind of weird.

hallow wyvern Jul 30, 2023, 7:08 PM

#

keen pivot Hmmm… I can look into it in more detail tomorrow. Definitely doesn’t look right....

I think "MCS just under 0.2" makes sense for the best-of-2048 samples of random normalized 512-dimension vectors, at least based on some quick hacking about in a repl

>>> a = np.random.rand(512, 1024) - 0.5;
>>> b = np.random.rand(512, 2048) - 0.5;
>>> a /= np.linalg.norm(a, axis=0);
>>> b /= np.linalg.norm(b, axis=0);
>>> print("\n".join([
        f'{b:.2f}-{b+0.01:.2f}: {ct}'
        for ct, b in zip(*np.histogram(
            (a.T@b).max(axis=1),
            bins=100,
            range=(0.0, 1.0)
        ))
        if ct > 0
    ]))
0.12-0.13: ###
0.13-0.14: ##############
0.14-0.15: ###############################
0.15-0.16: ##########################
0.16-0.17: ###############
0.17-0.18: ########
0.18-0.19: ##
0.19-0.20: #

#

I suspect I broke the autoencoder training code, I'll look into what's going on and come back with an update once I figure it out

weak meteor Jul 30, 2023, 10:49 PM

#

Where can I see a doc summarising what has been done so far and the plan ahead?

bitter turtle Jul 31, 2023, 8:46 AM

#

Hoagy did a write-up on lesswrong: https://www.lesswrong.com/posts/ursraZGcpfMjCXtnn/autointerpretation-finds-sparse-coding-beats-alternatives
We unfortunately don't really have a hugely concrete plan ahead atm, but we're working towards a draft paper at some point

AutoInterpretation Finds Sparse Coding Beats Alternatives — LessWro...

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort …

bitter turtle Jul 31, 2023, 8:54 AM

#

hallow wyvern I think "MCS just under 0.2" makes sense for the best-of-2048 samples of random ...

Just a minor thing: these vectors aren't uniformly distributed about the n-dimensional sphere, you'll have slightly higher density in directions like (1,1,1,...), and slightly lower density in directions like (1,0,0,...), so your distribution might be a little off. To sample from the sphere you should normalise Gaussian-distributed points (i.e. randn).

hallow wyvern Jul 31, 2023, 4:51 PM

#

Oh yeah, that is correct. Though it doesn't seem to make a huge difference. Changing the first two lines to use np.random.randn and rerunning makes any difference at all but not a huge one.

a = np.random.randn(512, 1024)
b = np.random.randn(512, 2048)
a /= np.linalg.norm(a, axis=0);
b /= np.linalg.norm(b, axis=0);
print("\n".join([
    f'{b:.2f}-{b+0.01:.2f}: {chr(0x2588)*(ct//8-1)+(chr(0x2588+(ct%8)) if ct%8 > 0 else "")}'
    for ct, b in zip(*np.histogram(
        (a.T@b).max(axis=1),
        bins=100,
        range=(0.0, 1.0)
    ))
    if ct > 0
]))
0.11-0.12: ▉
0.12-0.13: ███▏
0.13-0.14: █████████████████████▊
0.14-0.15: ███████████████████████████████████▍
0.15-0.16: ███████████████████████████████▊
0.16-0.17: █████████████████▊
0.17-0.18: ████████▊
0.18-0.19: █▎
0.19-0.20: ▋
0.20-0.21: ▉
0.21-0.22: ▉

Also I note that the original cosine sim graph shows that a nonzero number of features in the size-1024 dict have MCS >> 0.2 with ones in the size-2048 dict, so whatever I broke didn't cause quite entirely random features to be returned. Just very very close.

bitter turtle Jul 31, 2023, 5:11 PM

#

I think you can probably find this analytically, but eh

keen pivot Jul 31, 2023, 5:35 PM

#

Might be good to write out our current plans for the week @bitter turtle @pallid current if you want! For me:

Causal Alignment - write up fuller project for this & implement.

Extra-Todos: @weak meteor

Look through many examples of the causal alignment stuff (mine is the parantheses example) to find a cool one
Implement early-layers-to-late-layers, cause atm only can do later layers to earlier (related to causal alignment)
Look into outlier features for a few days, try to find cause, try to pass torch to someone else
Use features for activation engineering (may require training on large LLAMA models, which requires switching to baukit for training cause of GPU's)
[Note: Roko, I don't expect this to be clear TODO's. Can explain more]

bitter turtle Jul 31, 2023, 5:37 PM

#

Beat me to it! Was going to do exactly this this evening. I've asked Neel Nanda what kind of metrics he would like to see for causal alignment via email, hopefully will get back soon. If he doesn't I'll say fuck it and ping him or something, anyway. Will write up plans in about an hour and a half when I get back

#

Could you also elaborate on 2) for me? Tracing the causal path forwards doesn't strike me as immediately obviously useful. I also think 3) is a bit annoying but pretty universal, have heard some people propose solutions for newer archs, but it seems better to figure out a way to have our models ignore them

keen pivot Jul 31, 2023, 5:49 PM

#

bitter turtle Could you also elaborate on 2) for me? Tracing the causal path forwards doesn't ...

If a mid-layer detects Dates, we could find many further layers that make use of this feature. We would still be able to make causal alignment statements here.

#

There's a few papers on outlier features, but nothing mentions the \n or "." that I see in the outlier dimensions (which I've verified in just the outlier dimensions is a consistent token, but only in Pythia, not gpt-2 or others) which is novel AFAIK.

#

@bitter turtle I'd like your thoughts on this. Causal alignment (CA) feels circular here (at least for our use-case). CA assumes you think these parts of the model does some algorithm, which you can verify by changing the parts; if they have the same effect on the outputs & intermediate outputs as your algorithm predicts, then good.

The circular part is how do you come up w/ the hypothesis of the circuit in the first place w/o causal interventions?

This doesn't seem like a problem though, cause we can just do hypothesis generation by causal interventions. If the resulting algorithm is "simple" (whatever that means), then our features are good. If not, then booooo.

pallid current Jul 31, 2023, 6:01 PM

#

keen pivot Might be good to write out our current plans for the week <@332271551481118732> ...

yeah agree we should be writing up more. i sent to RB that paper planning doc and i've asked robert to start drafting a paper skeleton so that we have a really clear picture of where the holes are in the research

#

since you're going to be in town from tomorrow i think we should have a big chat then and do a list with assigned people and such

bitter turtle Jul 31, 2023, 7:48 PM

#

keen pivot <@332271551481118732> I'd like your thoughts on this. Causal alignment (CA) feel...

Ok, so my thought process was basically this, except measuring alignment to a high-level abstract causal model would give us an additional measure of correctness of high-level abstract description or whatever. Like, we can use ACDC to find a ciruit A at some arbitrary noise threashold, eyeball a high-level hypothesis for it via a description of some (pruned) subcircuit B of A as a causal machine M (the description would be human-interpretable by design, something ACDC doesn't provide by default), and then we can measure the accuracy of C in predicting the activations of B, giving us a measure of the human-interpretability-ness of B. Basically, the idea is that ACDC acts as a quick pruning strategy at some arbitrary noise threashold to help get started generating hypothesies for circuits in the sparse basis.

That was the original plan, but I am now very unsure about how we can compare systematically the scores of circuits found using the sparse basis and circuits in some other basis; there is probably a large variance of scores and an absurd number of circuits with fuzzy borders between them, and a lot of room for noise in where we draw the boundary between one circuit and another, or how we prune etc. Potentially we can come up with some complexity measure of the high-level description and measure the trade-off for both basies over a number of circuits, and if sparse coding is good we should see an improvement there, but that might be a little beyond the scope of this project.

keen pivot Jul 31, 2023, 7:52 PM

#

bitter turtle Ok, so my thought process was basically this, except measuring alignment to a hi...

Agreed on the nebulousness of circuits. I'll just give a go this week for several different types of circuits (e.g. the closing parenthesis one) and see what heuristics & results I come up w/. I'm overall fine if our paper has a mediocre implementation of ACDC & causal alignment on top of our dictionary learning. Like pretty solid overall, haha

keen pivot Jul 31, 2023, 7:53 PM

#

bitter turtle Ok, so my thought process was basically this, except measuring alignment to a hi...

I do think we can compare different ways of doing ACDC (for example, I'm ablating, which I could compare w/ restoring, which I could compare w/ shapely values of the top-5 ablated features).

#

#

This is ACDC w/ "[percentage effect]% | [cosine similarity]"

One thing I'd also like to check is the effect on intermediate layers on others. In the other graph, there was a connection between 4_1030 & 3_1273, but not this time. I should be able to easily record this & choose not to show it if the effect is < 1% or something. This is also another set of choices to make when implement this to compare to!

#

Also, I'm choosing to only look at the top-5 max-activating examples for a feature when I look at the differences when ablating causal features.

keen pivot Jul 31, 2023, 8:09 PM

#

keen pivot This is ACDC w/ "[percentage effect]% | [cosine similarity]" One thing I'd also...

Ah, I fixed the issue. I've set it to always display at least 3 children for each node, but only recursively pursue (ie ablate children of) the most important ones across all children.

#

#

This is w/ top-k examples set to 10, whereas the others were 5. So some connections will be different.

bitter turtle Jul 31, 2023, 8:25 PM

#

keen pivot Agreed on the nebulousness of circuits. I'll just give a go this week for severa...

well, I guess my point is that the actual circuits discovered by ACDC don't really matter that much, they are more like a guide for finding circuits if we go down the measuring-causal-alignment-ness route (better term needed: how about 'abstractiblity' or something similar), so the mediocre implementation would be fine. I'd like to get robustish measures of abstractibility though, that seems worthwhile. Like the idea of using different ways of doing ACDC, seems good for finding a large variety of circuits

bitter turtle Jul 31, 2023, 8:25 PM

#

keen pivot This is w/ top-k examples set to 10, whereas the others were 5. So some connecti...

yeah exactly; ACDC is kind arbitrary, so we can just manually find subcircuits to do interpretation and measure the abstractibility of

keen pivot Jul 31, 2023, 8:33 PM

#

Alright, let's stick w/ abstractibility for now, lol

keen pivot Jul 31, 2023, 9:17 PM

#

https://docs.google.com/document/d/1XOXQba0dQvOEuFdk6_RKrEoqHmbeGKDWi7nQeXoOuUg/edit?usp=sharing

I've got a few different settings here for graphs. These are K-5,10,20 (for how many datapoints we consider) & max vs halfway-to-max (for which k-datapoints we select. halfway is 0.5*max as a lower-bound).

Google Docs

Notes: ACDC effect of K

K = 5 (5 min) K = 10 (5 min+) K = 20 K = 5 (Median-to-Max) K = 10 (Median-to-Max) K = 20 (Median-to-Max)

pallid current Jul 31, 2023, 9:38 PM

#

keen pivot https://docs.google.com/document/d/1XOXQba0dQvOEuFdk6_RKrEoqHmbeGKDWi7nQeXoOuUg/...

where's the code you're running this with?

keen pivot Jul 31, 2023, 9:57 PM

#

pallid current where's the code you're running this with?

https://github.com/loganriggs/sparse_coding/blob/main/dictionaries_across_layers.ipynb

GitHub

sparse_coding/dictionaries_across_layers.ipynb at main · loganriggs...

Contribute to loganriggs/sparse_coding development by creating an account on GitHub.

#

You can also look at my folder on the node, where there is also the auteoncoders for layers 1-5 in my directory

bitter turtle Aug 1, 2023, 12:35 AM

#

ok so finally found a method that consistently works a significant amount better than our standard linear encoders; it gets the same unexplained variance at about half the mean no features active. ran on 8 chunks compared to the 30 chunk run that hoagy did

method is basically linear dictionary as per usual but with 5-layer (could probably cut it down to 3 w/o significantly harming performance) learned ISTA-plus-momentum encoder

don't expect to use this significantly much for the circuit stuff I want to get into tomorrow, and also the sparsity-to-l1 thing is very unpredictable, but I guess it's nice to know we can do better than just linear encoding; if we end up needing sparser dictionaries we can just throw this + a lot of data at it. also note that I think sparsity-to-l1 will be a lot nicer if we pretrain with a linear encoder

pallid current Aug 1, 2023, 12:36 AM

#

oh snap, super cool

#

purely linear decoder?

bitter turtle Aug 1, 2023, 12:36 AM

#

yep

pallid current Aug 1, 2023, 12:39 AM

#

damn that's big! potentially more in the tank still with more data maybe?

bitter turtle Aug 1, 2023, 12:40 AM

#

uh yeah pretty noisy tho, i'd want to pretrain the decoder for like 1 chunk with linear encoders to get the dict right then freeze and train the encpder to that, then start training both in simul to get more consistent results, but yeah probably

#

took A Fucking While to find this but maybe useful especially for derivative stuff

pallid current Aug 1, 2023, 1:00 AM

#

bitter turtle uh yeah pretty noisy tho, i'd want to pretrain the decoder for like 1 chunk with...

hmm interesting wonder if that's necessary

#

if you point me to some dicts thats are beyond the pareto frontier i'll see how they do on autointerp

bitter turtle Aug 1, 2023, 1:28 AM

#

pallid current if you point me to some dicts thats are beyond the pareto frontier i'll see how ...

sparse_coding_aidan_new/output_4_rd/ or something? Don't know which ones are best, I'd look at the sparsity-80-odd ones achieving about 0.05 variance explained? Or the random one in the 40s

bitter turtle Aug 1, 2023, 1:28 AM

#

pallid current hmm interesting wonder if that's necessary

Well, I think it'd improve consistency, but I guess we could just big run + multiple initializations + take good

pallid current Aug 1, 2023, 1:49 AM

#

bitter turtle Well, I think it'd improve consistency, but I guess we could just big run + mult...

yeah fair. i think the benefit of this approach depends on whether we think after 1 epoch of linear training the dict is like 'basically right', and you just finetune the encoding strat, vs wanting it to learn something substantially different

#

my intuition is that the noise reduction isn't the computationally difficult part, compared to doing good feature finding, so i expect it not to help too much but i could easily be wrong

#

also what's the role of ISTA in this setup? i thought in ISTA et al the encoder was just a set of feature weights learned fresh for each case, rather than a particular (eg 5 layer) formula?

bitter turtle Aug 1, 2023, 6:27 AM

#

pallid current also what's the role of ISTA in this setup? i thought in ISTA et al the encoder ...

Right, so LISTA is basically ISTA but with some parameters learned, all unrolled into a net, so each layer corresponds with one iteration. It's more computationally efficient+converges better or something, there's a bunch of lit about it. Specifically this is LISTA+a momentum update, can't tell quite what it's called I think LFISTA is a reasonable name which I think I saw in the lit somewhere but can't find it now

bitter turtle Aug 1, 2023, 8:25 AM

#

ok, slightly good signal/sanity check, we're consistently beating [take the top-k components of PCA and project to that subspace)

#

the 'half mean no activations' was at 0.05, seems to be less big overall, but I think sparsity in ~100 range is reasonable anyways

keen pivot Aug 1, 2023, 11:43 AM

#

bitter turtle ok, slightly good signal/sanity check, we're consistently beating [take the top-...

Oh, nice sanity check!

bitter turtle Aug 1, 2023, 12:49 PM

#

@keen pivot where is the code you are using for the circuit stuffs?

#

I might also wait until you and hoagy have your meeting not sure what to do right now atm

keen pivot Aug 1, 2023, 4:30 PM

#

bitter turtle <@360082080975290369> where is the code you are using for the circuit stuffs?

I linked it above to Hoagy

keen pivot Aug 1, 2023, 4:30 PM

#

keen pivot https://github.com/loganriggs/sparse_coding/blob/main/dictionaries_across_layers...

@bitter turtle

bitter turtle Aug 1, 2023, 5:15 PM

#

Ah brill, sorry I missed that!

pallid current Aug 1, 2023, 6:59 PM

#

here's the correlation between interp score and feature variance, skew and kurtosis:

bitter turtle Aug 1, 2023, 7:25 PM

#

Cool, what are your takeaways on this? I'm not sure how much I trust autointerp. How many dicts are here?

pallid current Aug 1, 2023, 7:57 PM

#

takeaways are: searching by high variance (and mean, and % cases active, have also checked those now) for good features is not going to work, bit disappointing because i hoped that might give signal for which feats to choose in highly overcomplete dicts (though would be worth rerunning this with much larger sizes)

#

skew and kurtosis seem to be pretty much identical, no distinct signal between them but they're a reasonable proxy for feature goodness

pallid current Aug 1, 2023, 8:15 PM

#

might clean and send to openai, i think the above graph is wrong i logged the wrong variable lol but the effect stands, will update in a bit

bitter turtle Aug 1, 2023, 8:17 PM

#

Yep yep

bitter turtle Aug 1, 2023, 8:21 PM

#

pallid current takeaways are: searching by high variance (and mean, and % cases active, have al...

I'm not sure autointerp will necessarily be correlated with 'useability for circuits' like maybe it will but also it's weird and fucky? I'd select by some combination of sparsity and proportion variance explained maybe. What did you find for n dead neurons and how was it correlated (at all, a little bit, literally just noise?)

keen pivot Aug 2, 2023, 9:49 PM

#

shell mural Aug 2, 2023, 9:50 PM

#

Looks extremely cool, help me interpret what I'm looking at

#

Is this a dictionary circuit

keen pivot Aug 2, 2023, 9:50 PM

#

Oh, I was trying to figure out how to make it high-resolution, but you just click "open in browser" after you click on it initially

keen pivot Aug 2, 2023, 9:51 PM

#

shell mural Looks extremely cool, help me interpret what I'm looking at

The top row is dict for layer 5 residual. The rest are previous layers and are prepended by "4_..." for layer 4

#

The text is my interpretation of what the feature means

#

shell mural Aug 2, 2023, 9:52 PM

#

Insane, extremely cool

keen pivot Aug 2, 2023, 9:52 PM

#

The percentage here means, given 10 activating examples of feature 4_1030, when I ablate feature 3_891, those activations go down by 36% on average.

#

The "... | 0.83" is cosine similarity to track how similar the directions are.

keen pivot Aug 2, 2023, 9:54 PM

#

shell mural Insane, extremely cool

Agreed, I'm very excited. Still more work to do to more rigorously show it, but it's quite crazy that they fit so well so far!

pallid current Aug 2, 2023, 11:15 PM

#

running autointerp over the lista dicts and getting a v high proportion with no activations

#

assuming that it's layer2resid

keen pivot Aug 2, 2023, 11:15 PM

#

Each is 2k. Can you plot like a hist of it?

#

I'm getting decent perplexities for layer2resid, so must be true

#

Layer2resid:

Perplexity for l1=1.16E-05: 51.76
Perplexity for l1=1.35E-05: 93.64
Perplexity for l1=1.56E-05: 147.65
Perplexity for l1=1.81E-05: 97.01```
Layer3resid:

Perplexity for l1=1.00E-05: 117.17
Perplexity for l1=1.16E-05: 67.78
Perplexity for l1=1.35E-05: 112.00

#

Though they're surprisingly close which may mean more about the similarity of the residual stream. Still should investigate this (in case something fishy is going on/code error)

keen pivot Aug 2, 2023, 11:40 PM

#

pallid current Aug 2, 2023, 11:40 PM

#

keen pivot Layer2resid: ```Perplexity for l1=1.00E-05: 83.06 Perplexity for l1=1.16E-05: 51...

how do these compare to previous perplexities?

keen pivot Aug 2, 2023, 11:40 PM

#

Min is 43 perplexity. Base model is ~25

pallid current Aug 2, 2023, 11:41 PM

#

im getting an OOM when i i try to generate the histogram, dont wanna screw the big run 😟

pallid current Aug 2, 2023, 11:41 PM

#

keen pivot Min is 43 perplexity. Base model is ~25

min being like zero l1 alpha?

keen pivot Aug 2, 2023, 11:42 PM

#

pallid current how do these compare to previous perplexities?

I just did layer 3 activations instead of layer 2

keen pivot Aug 2, 2023, 11:42 PM

#

pallid current min being like zero l1 alpha?

Like 3e--5

#

Perplexity for l1=1.16E-05: 51.76
Perplexity for l1=1.35E-05: 93.64
Perplexity for l1=1.56E-05: 147.65
Perplexity for l1=1.81E-05: 97.01
Perplexity for l1=2.10E-05: 46.16
Perplexity for l1=2.44E-05: 55.72
Perplexity for l1=2.83E-05: 43.57
Perplexity for l1=3.28E-05: 95.97
Perplexity for l1=3.81E-05: 98.02```

#

Super noisy down there. Maybe they're learning the identity

pallid current Aug 2, 2023, 11:43 PM

#

that's a lot of variation!

keen pivot Aug 2, 2023, 11:43 PM

#

Huge

pallid current Aug 2, 2023, 11:43 PM

#

hard to learn the identity through 5 mlp layers!

keen pivot Aug 2, 2023, 11:43 PM

#

Imma do qualitative on the 43 one to see if it learned the identity...oh, lololol

#

Also w/ 2k features

#

I can at least look at the decoder.

#

This is previous results on a good dict for layer 3. So the LISTA results are better, but may not be if they're learning an identity like thing.

#

43 would be huge if it's real. Would just need to train more & do better ML for convergence. I also expect needs to be bigger based off previous results replied to above.

#

I also like Aidan's idea (on a call from today) of comparing PCA w/ the dict across the same number of dimensions for both perplexity diff & variance explained.

I don't think it should be better because one benefit of dictionary is we can use more dimensions than the original, but if it is, then that's a cool result.

pallid current Aug 3, 2023, 12:05 AM

#

what was the lowest perplexity for a non-lista dict?

keen pivot Aug 3, 2023, 12:15 AM

#

#

These are the non-zero activations, and I gotta say it doesn't look good!

bitter turtle Aug 3, 2023, 12:33 AM

#

pallid current running autointerp over the lista dicts and getting a v high proportion with no ...

Weird. Also weird that it has better sparsity-to-variance explained than other ones but also a bunch of zero directions. I'm very confused

bitter turtle Aug 3, 2023, 12:34 AM

#

keen pivot

I think this histogram thing might be broken slightly. @pallid current what code did you use to generate the previous histogram, could you try comparing that code and this code?

pallid current Aug 3, 2023, 12:35 AM

#

bitter turtle I think this histogram thing might be broken slightly. <@566946805028225034> wha...

i can but the code for that uses too much memory somehow

bitter turtle Aug 3, 2023, 12:35 AM

#

ok aaay

pallid current Aug 3, 2023, 12:35 AM

#

wel at least, gpt2 is almost full and i thought that was from my jupyter server but ive cut that and it's still nearly full so unsure tbh

#

**gpu2 loool

keen pivot Aug 3, 2023, 12:35 AM

#

I decided to look at the nonzero activations & see if they're meaningful.

bitter turtle Aug 3, 2023, 12:36 AM

#

I think the hist thing in standard_metrics.py is broken anyway, if that's the one Logan's using

#

something is broken, somewhere

keen pivot Aug 3, 2023, 12:36 AM

#

Another negative thing is that the two best perplexity dicts have near 0 MCS w/ each other

#

I'm using my own code! I still may be doing things wrong being not used to the new dicts!

#Sparse Coding