Ok, so I think you can get a long way through the standard lens of 'neural networks as bayesian optimisers' here (at least, informally); if you assume some information-minimisation prior you might be able to get to (something-approaching) superposition downstream of that. More generally, I don't think that a formalisation of this is particularly useful (it seems Highly complex, and empirics are good and we should use them) and I also don't see how 'removing superposition' is a paricularly useful approach (doing so would significantly impact performance, and the model would probably get around our training guardrails somehow (see Neel's SOLU stuff ig?))
#Sparse Coding
1 messages · Page 3 of 1
@bitter turtle I'm not getting any high-mcs features for any of the dicts. I'm comparing dict_ratio_2 w/ dict_ratio_4, and have done tied & untied across all l1 values. It didn't seem to work for the known-to-work l1 of 1e-3, so 🤷
I could've code something wrong. Have you been able to get any high-mcs features/graphs from this, even the toy?
what is the non-mcs performance of the dicts like
Haven't tested it
which metrics?
like loss on actual data, sparsity level etc and how does it compare to the other ones we did with the other trainer
if it's different something's wrong with the training code if it's the same something's wrong with your code or dictionaries are weird asf
the wandb doesn't track sparsity. I could run the model on some data & check.
Bear in mind this is 8x~2GB chunks, could be a lack-of-data thing
Yeah that's what I meant
So this is 14GB overall for _7?
Yeah
Well 15 I think
@pallid current did you do autointerp with directions from the new run code or an older run
What do the distributions look like
i ran autointerp on dictionaries that logan ran with the old infra about two weeks ago
Let me do that sparsity check
Oh, how are you scaling L1, I might not have implemented that properly, might wanna check the loss function implementations
What do you mean?
"dict_ratio_4_6.pt": {"l1_alpha": 0.003162277629598975, "bias_decay": 0.0},
this is a good l1_alpha term
Like how exactly is the loss function implemented
I remember someone saying something about scaling by 1/J, and I'm wondering if I did that properly
@keen pivot
Ah, I see where you did that.
You actually don't want to do that because it causes the diagonal thing
So these l1-values are really low
I'm getting a sparsity of 600/10000
Ah, that might explain it then
It might also explain the really low reconstruction loss, lol
Could you look over the loss function when I reimplement it?
Yep!
it looks like just:
l_l1 = (buffers["l1_alpha"] / c.shape[0]) * torch.norm(c, 1, dim=-1).mean()
removing the c.shape for both the tied and untied
Thank god it only takes like 16m to train lol
Also tracking sparsity would be useful
yep will do
I think it's:
x_hat.count_nonzero(axis=1).float().mean()
How would you like this measured?
ah just total
I might do that as well as 'num Nonzero per feature over last chunk' or something
Like per token, there's how many nonzero latent activations
The features/token metric helps check if we've set the l1 too high (zero activations) or too low (several hundreds, so mostly identity)
Although I guess this is less useful and we can just measure this at the end
weird link
looks much more sparse now
but bloody hell these reconstruction losses
I've been thinking that we could using a more powerful encoder (a full-blown feed-forward multi-layer net) while keeping the decoder limited to a linear map
@bitter turtle, one thing, the 8 naming scheme includes "_group_1" and the others do not. Is that intentional cause the 8 sized ones are bigger and split across two GPU's?
yes
This could work. The benefit of the tied embedding is that the direction we're reading from is the same direction we're able to write to. But the future MLP layers don't have tied embedding, so maybe that's okay to not do tied?
Overall, the low reconstruction loss from earlier was caused by the high sparsity/near-identity dictionaries
Well, my thoughts are that we should let the net do more intelligent denosising than just cross-producting with the dict; not sure what you mean by the future MLP layers don't have tied embedding
looks like tied is getting way higher recon losses than untied 🤔 so confused about what extra computation it manages by not having them match up
yeah that's why I think we should just slap on a big net
One possible constraint for our dictionary learning is that we're learning features that the LLM is using in future layers, so we should limit ourselves to the capabilities of future layers
Also, I'm all for slapping the full net on now & running it, haha
ah, sure
I'll code it up but I worry it might be horribly unstable to train
Could skip tied for now
slapping on the full net is also equivalent to the standard dictionary learning thing where you just freely optimize the dict entries
ye, that's what I was thinking; might also be useful to have a deterministic denoiser tho? idk
still feels kinda wrong to me tho
The wrongness will show up in ablating the feature direction & logit lens. If it doesn't have a meaningful effect (like our current ones do), then it's bad on that metric.
really? how come?
because it breaks my mental model of how the network is using the features, like in my head it kinda looks like TMOS, you have features + inference, you use the negative bias to screen the interference and then you reconstruct in the same direction
if its not doing that then i guess i just dont have a good picture of what's going on
@keen pivot should be in /mnt/ssd-cluster/resid_layer_2_19_07_scaled_l1, feel free to delete the other one
I feel like this relies too heavily on TMOS being a perfect model for model internals, like more might be going on than just this
could be some additional denosing/information loss
i wonder if it would help the model if we added back the bias immediately after the sparsity penalty. like at the moment it adds the bias, RELU, and then has to reconstruct but it'll be missing an amount equal to the bias, so maybe we should just add it straight back on?
conditional on the feature being nonzero presumably
yeah could do
pretty confident that we are solving a different problem to the model tho, the model can do things lossily/in superposition, while we are looking for perfect replications or whatever
did we work out what the runs yesterday got such low recon loss btw?
I was scaling l1 wrong
thank fuck
dunno how proper ML researchers do it tbh
the feedback latency is horrifying
like how can you trust your own code to train for 8 months
slightly cropped
This also shows what I noticed earlier, which is tied having better MCS. If Aidan is right about LLM solving noisy stuff, then we may have to ignore high-MCS entirely mostly
hmm interesting
how come?
"LLM solving noisy stuff"?
this I think
Tied embedding makes there be only 1 solution, as in one set of residual stream dims to read/write from. Untied allows many directions to read from. So maybe two different dicts learn different features if they're untied.
not sure what you mean
But if we ignore high-MCS, and just say "if the sparsity isn't insanely high or low & we get good reconstruction loss, then maybe the features themselves are good", which we can check by hand or by auto-interp
Anything specific? I can just say it in different words
Tied embedding makes the model read and write from the same direction
untied allows the model to read from multiple directions and write to only 1 direction
model= autoencoder
well, I disagree there, untied allows the model to read from one different direction, not multiple. this might be an important distinction. I think allowing it to read from one different directions is weird and slightly wrong, we should allow it to read from many instead
Let me write an example
but I understand what you're getting at
not sure how you get from that to this, however
I'm using the residual stream to define the direction, not the weights.
Ah, maybe not.
I wanted to say something like:
Because of the ReLU & negative bias, you can have multiple ratios of residual stream that activate the feature, which are multiple directions in residual stream (though the same direction in weights since those are frozen). For example, for weights (1,1) reading in from the first two dimensions of residual stream, we have:
F_1 = ReLU(w1*r1 + w2*r2 +bias)
F_1 = ReLU(r1 + r2 - 1)
Which can be positive if r1 or r2 is large or a sum of them.Though writing into the residual stream w/ the decode is always the same direction.
Though this is true for tied embedding as well.
Also another counterpoint: We may just need a better l1 value for the untied model.
@bitter turtle, would we be able to run w/ a larger encoder/full net soon? I can code it in, though I didn't know what you had in mind.
Going rock climbing, can look at the models when I get back if they happen to be trained by then!
@cosmic moon Could you give a short blurb on why the linked work is related to this project channel?
I mean, it seems at least tangentially related to the general theme of this channel (searching for modularity in NNs), but not neccesarily directly related to sparse coding particularly.
@bitter turtle, I'm running just a basic 2-layer encoder (first to 1/2 dictionary size, then to full) on cuda:0 on the old infra, just fyi. I didn't know how to easily change yours to add a second encoder param.
yep I just did it
I'll do a commit with the 2 layer encoder in a bit, just need to test it
(I'm doing d_act -> dict_size -> dict_size fyi, don't have any reason to prefer either other than compute)
@pallid current , we ever figure out why lee is against a bias for the decoder?
ok, will continue testing after I've eaten
Ya, I think I'll do some quick tests to get a rough idea on the effect & l1 hyperparams, but your code will be much more efficient at training many dictionary sizes
An additional thing would be to try a tied embedding on a really big dictionary (32x) w/ a lot of data (or really until convergence)
@pallid current , I looked at the hyphen stuff and they're quite interpretable and separable
Slight update on the "2 layer encoder": It gets a reconstruction loss of 0.070, which is okay, but not amazing. The high-MCS is also garbage (<1%).
I'm just running the tied embedding w/ much larger dictionaries and for much more data.
For the is/are one, some are clearly separable, but like 2-3 others look similar, but I believe they're is/was that only activate in a specific distribution of text (e.g. technical, news article, etc).
I currently don't have the tools to figure that out, because I'd need to know which previous features cause these to activate.
ellena reid in my seri mats group is working on applying sparse coding to audio transcription models and pointed out that it's a bit weird in our tiedSAEs to normalize the decoder weights after they've already been applied the first time
oh what
yeah my bad
I guess this could actually make sense
like, plausibly the scaling could be different for optimal reconstruction
we should maybe look into that
@keen pivot https://wandb.ai/sparse_coding/sparse coding/runs/f49wmqnt
model with 2-layer encoder gets some good results for certain dict sizes no it doesn't there's an absurd sparsity-reconstruction tradeoff
also seeing loads of totally dead ones
I guess we might have issues where they just optimize for sparsity
yeah this looks like not immediately good
@pallid current latest commit should be ready for merge
Ya wanna shoot for around 20 sparsity in general. I also saw in mine horrible MCS.
For 2-layer, I got 1e-4 as a good l1, but it was still pretty bad & didn't have too great a reconstruction
yeah it might just be insanely brittle
If you wanna do the PR, then I can sync to your branch
If not I can idm
yo yeah i'll starting merging now
merged the new ensemble stuff, currently rerunning some of the graphs in the post with argmax(max) instead of argmax(mean) and still tryna fun autointerp on the new results lol
off for a bit but will prob work a bit later
I'm just trying to get reconstruction loss down by doing tied w/ smaller l1's to see if that helps, while still learning meaningful features.
Additionally, tomorrow I can look into the dataset of predictions that do best & worst on perplexity to see if there's a pattern.
been reading this paper that was recommended to me yesterday, v relevant: https://arxiv.org/pdf/2210.01892.pdf
notes:
-
they define the capacity allocated to a feature i as $C_i = \frac{(W_i\cdot W_i)^2}{\sum_j(W_i, W_j)}$ where $W_i$ is the weight vector in the embedding matrix for feature $i$
total capacity can be no more than (but can be less than) the total embedding dimension D -
find that across the model, capacity will be allocated at the point where the marginal value of capacity in that feature is some constant value (if the marginal value were different then you could reduce loss by reallocation). therefore you can only expect to see superposition if there are decreasing returns to capacity, which they say occurs for inputs of high sparsity or kurtosis.
-
asserts a strong relationship in general between sparsity and kurtosis which i hadn't understood before (this seems quite well studied in neuroscience eg https://iopscience.iop.org/article/10.1088/0954-898X/12/3/302, should probably look further since this is 20yrs old)
-
find that you get full capacity if you have a weight matrix which is semiorthogonal, meaning that $WW^T=\lambda I$, (as well just going diagonal of course). can combine the approaches by having orthogonal subspaces - at small dimension this becomes TMOS' polytope model.
hoagy
Oh yeah v good paper, did you have any thoughts on how we could use this? Will check out neuroscience paper
hhhh Bristol uni doesn't provide access
I definitely need another hour or so to grok the paper, but one part I'd like to understand is kurtosis.
wiki says it's a measure of the (amount of outliers? extremity of them?).
So if we try to intentionally find directions w/ high kurtosis, then, given an activation dataset, we can find these directions by having a measure of kurtosis, then optimizing the direction, and defining loss as kurtosis?
Then repeat and add a diversity term so you find new directions.
I also don't understand sparsity & kurtosis being similar (the neuro paper says they're not, except for sampling kurtosis?). Like you can have many outliers & that shouldn't effect the frequency of them?
Update: I'm seeing better reconstruction losses for higher sparsity values for tied embedding (e.g. 0.5-0.4, where earlier we had 0.75). This is unsurprising, but I still need to see if there are meaningful features, we didn't learn the identity, & the actual perplexity difference.
Tysm
Not sure what you mean, kurtosis isna measure of heaviness of tails which is like approximately what sparsity is
For instance, on symmetric uniform distribution with additional mass on 0, kurtosis ~ sparsity³
From the wiki:
This number is related to the tails of the distribution, not its peak;[2] hence, the sometimes-seen characterization of kurtosis as "peakedness" is incorrect. For this measure, higher kurtosis corresponds to greater extremity of deviations (or outliers), and not the configuration of data near the mean.
Edit: I see this doesn't talk about your point. Could you explain how you think kurtosis relates to the shape of the distribution?
Can you define sparsity here?
I want to say frequency of the feature activating, which if we operationalize "feature activating" to mean "activates more than N std's above mean" (or something), then that makes sense to say a feature is more sparse if it has a thinner tail (and vice versa)
Frequency of feature activation, i.e. frequency that X is drawn from the uniform distribution and not 0
Also the neuro paper doesn't say this afaict?
Although these ideas are related, they are not identical, and the most common measure of lifetime sparseness - the kurtosis of the lifetime response distributions of the neurons - provides no information about population sparseness.
And sorry, I think I'm coming across as making strong claims, but I'm just confused and appreciate your help!
oh no not at all
rather than saying that 'kurtosis doesn't measure sparseness' they are saying 'kurtosis (as a measure of sparseness on this axis) doesn't measure sparseness on this other axis'
Just to ground out an example, I've plot the activations of the residual stream & one of the dictionary features
Maybe you could automatically search for meaningful features by finding directions that match the right graph, more than the left. One measure may be kurtosis, which would be the E[normalized(x)^4], which the right graph has more than the left, right?
the right is basically normal right? I think it has zero kurtosis
Oh, it might be a different order on my device. The residual stream one looks normal to me, and the dictionary one doesn't
oh
basically a 'spike and slab' (i.e. sparse) distribution looks more like the red one than say a normal distribution
you can't strictly speaking draw it as a pdf because [][][][] but like 'things with their weight on the tails spread out over a larger area' have higher kurtosis I think
Okay, so kurtosis will be lower for rarer features, right?
uh
Noooo
Don't think so
Wait one sec
Let me calculate this
actually no im really confused
might have done my maths wrong
Okay, but suppose we have two feature, lolololol
If kurtosis is E[x^4], then if you have 1% of values at e.g. 100 after normalization compared w/ 0.01% of values at 100, then the expectation takes that into account and gives different values?
yes?
i think we should definitely track the total capacity and capacity per feature as metrics. and also see whether our feature matrices are block sparse- i dont know how to test this but should be easy enough
So then one will be a rarer feature (e.g. 0.01% compared a/ 1%) which will mean it has a lower kurtosis.
mb I goofed it's 1/p I get it now
1/sparsity
phee
phew
sounds great, block sparsity kind of what I was going for with the covariance stuff
I haven't read it enough to code this. Does it seem easy to integrate?
don't have a good culmative metric
yep, simple calculation
can do it now
excess kurtosis of a bernoulli variable (x axis is the p param from 0-1 but i overwrote the xticks func sozzz):
goes to +inf at x = 0 or 1
& excess kurtosis is just regular kurtosis - 3, so that normal distribution is set to 0?
Does this relate to "rarer features will have lower kurtosis" ?
also that kurtosis isnt just E[x^4], it's the standardized moment, you do E[(x-mean)/std_dev]
Correct
i think this shows that rarer feats (when viewed correctly/separated out) will have higher kurtosis
I just checked & you're right. I don't understand how the bernoulli variable graph gives intuition for that.
because as the feature likelihood goes down to 0, the kurtosis rises super fast
ok it doesnt give intuition lol but it does show it!
Ah, I see. Gotcha
The only intuition I've got is that normalizing causes the effect. If you have more outliers, then the mean is shifted & std is greater, which has a large effect on the ^4 part.
that's why you standardise it
similar figure for kurtosis (not excess) vs feature density (where the density (x-axis) is the probability that the feature is nonzero and uniformly distributed over [-1, 1])
goes to inf the rarer the feature is
I lied, we can't have exact capacity here, because we don't know feature 'magnitudes'; we can probably still expect proportional capacity to be a useful metric though (defining proportional capacity as 'capacity over normed dict' which is probably a synonym for 'amount of interference on feature i')
unsure what 'total proportional capacity' would get us, would expect that to be fairly constantly minimal
can't we just treat our predicted feature activations as the true magnitude of the true features?
and once we do that our set up becomes basically identical to that in the paper?
not exactly sure. In the paper there is at least some idea of the range of activations of features, and it's kind of uniform across features, while with ours we see ridiculous variance in activation
is that a problem tho? to me it's just an interesting part of our findings (didnt know btw!) because if there are different scales then presumably that creates more interference for any direction with has some degree of cosine sim
i think the capacity paper predicts that those directions would have more of a full dimension to themselves
Hmm, yeah
I think that by default the absurd activation dimensions (the ones with like 1k max activation) have a full dimension to themselves (1k/(1k + n*epsilon) is basically just 1) even without the predictions of the paper
Definitely can look at it tho
we shall see
How are we measuring this anyway, I feel like mean activation is a decent idea since then it becomes an approximation for expected interference?
We could just directly measure Expected interference
Let $c_{k,i}$ be the value taken by feature $i$ on batch $k$. Then,
$C_i = \frac{1}{K} \sum_k \frac{c_{k,i}}{\sum_j (c_{k,j} * (W_i \cdot W_j)^2)}$
aidan ewart
I feel like expected interference is kind of what the fractional dimensionality thing is measuring in their paper.
Or maybe just over the cases where it's nonzero
isn't this identical to the paper if you take c_i to be the size of the incoming weight instead of activation
and adding dot products and activations seems wrong
Wdym 'incoming weight'
I'm multiplying?
like i think they're basically measuring expected interference given that the input is uniform or something?? and then normalizing to express it as fractions of a dimensions
Oh yeah sure agree, that's also what this is
yeah true haha
\times was too many characters
Yeah ok want to measure the 'frac when feature is nonzero' thing then probably
so like have some empirical interference measure based on the activations which we can use as a complement to the weight based ones?
that makes sense. im gonna go off discord for a bit cos i havent done focused work properly in a couple days, back this eve
since our dict is normed I don't see how we can do a purely weight based one
Yo, I run a mechanistic interpretability reading group on another server. The topic for next wednesday has been chosen: dictionary learning!
We read papers and posts related to given topic each week and occasionally invite the authors of the work we went through to join for a Q&A sorta discussion
I invited @keen pivot as a guest speaker, but there's others in here e.g. @pallid current (and possibly more that im missing) that are also heavily involved with the project. Wanna join too?
What time?
It's tentatively 1pm EST (10am PDT) by default, then we allow a fudge factor for flexibility on the guest speakers part
Logan said he was free at that time, not sure about you guys though
OK nice yeah I can say hi at that time tho I'm sure Logan's mostly got it covered
Immediately set off a slow run right after posting this lmao
Pahaha
Which server?
Nvm got it 😄
Would definitely be interested in coming to the Q+A, to get a feel of other people's takes on this direction
does anyone know of a 'one-sided' kurtosis for asymmetric distributions? tempted just to use E[X^3]/variance or E[X^4]/variance
https://wandb.ai/sparse_coding/sparse coding/runs/1w4vlec6 <- run tracking expected interference and 'asymmetric skew' (E[X^3]/sd^3) (random metric I pulled out of a hat)
hmm, I wonder if I have botched expected interference, these seem low
should be approaching 1
currently running a small auto_interp cycle (40 feats) on all 12 of the non-tied final epoch dicts from tuesday's run
awesome
some obvious differences in how often some of them just dont have enough nonzero activations. just from eyeballing it loooks like there might be some differences in average score but wont really know at all till i run the graphs
will prob need more than 40 to have any confidence tbh
I'd be very interested to know how the results change when you do autointerp on the decoder directions, assumed you were doing that already
yeah it just never crossed my mind, my bad, will run it once the current exp is done. i should prob also look into whether i can parallelize the gpt4 api calls, it's starting to be a big bottleneck
Ok sweet! I'll make an announcement on the server and send hoagy an invite
Logan also requested I send a google calendar invite, if either of you want one of those too you can dm me your email and I'll figure that out
results from autointerping 12 nontied 2-dict-ratio dicts:
Can't remember the numbers, those look better?
general takeaways:
- l1=0.01 is a nono, dead feats everywhere,
- l1=0.003 looks noticeably worse
- 0.001 and 0.003 look basically the same.
- cant see any obvious effect of the l2 bias, might be worth setting it higher in some runs just to see if that does anything
- performance looks roughly similar to the original experiment (far left). difference are that these use tied and only a 2x ratio (4x for the original residual stream exps)
- seemingly doing a bit better on top/top-and-random but a bit worse on random
Ok cool! can do a more fine-grained search tomorrow if that'd be useful
potentially, though there's a huge amount of stuff still to check: larger dict sizes, how it evolves through the epochs, tied ones
yep yep
How should I think about perf on top Vs top-and-random Vs random?
do you have any thoughts on that?
i think random is ultimately the true measure. like if we were able to filter out noise perfectly, and only detect cases where a particular feature was truly active, and then specify it's conditions perfectly then we could theoretically get perfect scores on random, and that's the highest bar
openai discuss it at: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html#sec-algorithm-details, interesting comment that i didnt notice before but agree with is "A more principled approach which gets the best of both worlds [top-and-random vs random] might be to stick to random-only scoring, but increase the number of random-only text excerpts in combination with using importance sampling as a variance reduction strategy."
ran this, no noticeable difference, maaaybe a tiny bit higher, will do it by default from now on
I think we should try lower values then, like 8e-4 and lower. I expect 1e-4 to 1e-5 to be identity
But lower l1 is better reconstruction, so it’d be good to see the regression to the neuron/residual basis, and just pick the one before it drops off.
Does that make sense?
i see what you mean but i dont agree.. i guess i think the thing that we're trying to do is find good dictionaries, and i think the quality of the features is probably still going up at that point, even though it'll take a bit of a hit to reconstruction_loss
i agree we should do some more finegrained tests tho, i'd say restrict to 5e-4 - 5e-3 in future
btw am suddenly getting torch.cuda.is_available() == False"???
might have to do a big save and restart the node
still an issue, deleting a couple of old backups and restarting, backing up everything currently on the server onto /mnt/ssd-cluster, space on the ssd-cluster will be kinda tight after this, have pinging mr waifu for more
I agree this is a possibility, but it’d be good to verify with auto-interp.
Those l1 values sound good!
Okay, I've gotten a little better reconstruction loss (.075->.069) by just training on a larger batch size (256->2048); possibly explained by just training on more data, but I haven't checked.
Additionally, we have much lower reconstruction losses for lower l1-values; the far left one does have more MCS > 0.9, but all 3 have decent looking distributions overall.
I can look into the low-reconstruct dictionaries specifically to ensure no identities were learned & sample a few features for interp sake. We can also do auto-interp.
Yep, it looks like quite meaningful features. Here's the 2000'th highest-MCS one (MCS=0.85)
@pallid current how many tokens/datapoints are you using to do top-random sampling for hypothesis generation?
I want to communicate my surprise that GPT-4 wasn't able to understand the hyphens had numbers before them when I reviewed it's hypotheses yesterday.
maybe we should look at linear VAEs; we still get 'good dictionaries' but can explicitly ignore a degree of 'noise' from less important features
One problem here is that perplexity does go up pretty significantly when replacing w/ our current dictionaries (e.g. 25->100, though I should also check perplexity on some other dataset than pile-10k). So there are important features (or something) that we're missing when training lower-sparsity/higher l1 models.
well, I wouldn't expect a transformer to be entirely describable via sparse coding anyway; NNs implement a bunch of different algos in different ways many of which don't involve sparse features
Oh, one of them may be the outlier dimensions, though the model does seem to capture those dimensions quite easily.
Imo its better to have fewer high-confidence, really well understood features corresponding to high-importance concepts than to have some bad sparse decomposition of the entire thing
Agreed, but knowing which ones are better can be done empirically.
Not sure what you mean; I'm more thinking along the lines of models having subcircuits that are not using sparse features at all, and instead use something like the modular arithmetic circuit or something
Ah, but I worry that using reconstruction loss is negatively impacting our learning of high-importance features
Ya, I did want an example actually. How would that circuit not be described by sparse codes?
well, like in the paper
That's an understandable concern. What would be the metric that convinces you one way or another?
I think you could 'describe' it with sparse codes but they would be bad and not the best fitting description
which is kind of what we aim for
Would that be like a piece-wise linear approximation?
I think we should do more counterfactual testing with high-MCS features
I can get on that ig
Lol
Not sure what you mean
Wht does counterfactual testing mean here?
Like causal scrubbing-type-things. Ablations etc.
Like approximating x^2 w/ several piece-wise linear functions mostly describes it, but is inneficient and not exact.
Is it just like my post or anything else different?
Oh, right. Yeah, so you can theoretically probably describe anything with arbitrarily sparse codes but you need lots of them and it would be a really complex and bad encoding so kinda similar yeh
Not sure can't really remember will check again
I don't really trust autointerp
I do like this train of thought. I can see two tasks:
- Verify features found by various sparsities (from 5 features/token to 500). This will eventually become the identity.
- Given the best dictionary from (1), we can find the greatest perplexity-diff between the original & reconstructed models. (ie run perplexity test on original, then run on reconstructed. Find the datapoints w/ the greatest differences) Those diff points may point towards functions in the model that aren't best represented by sparse codes.
I like how autointerp is able to tell the difference between the basis & the dictionary, so I also expect it to find when the dictionary becomes the basis (when it learns the identity). I do agree that finer-grained measures (ie is this dict better than that one) is uncertain.
Would be awesome to have someone else looking at features by hand & figuring out ways to maybe auto-detect polysemanticity?
yep will try two haven't done this kind of interp b4 will bug you for help in dms if i end up needing it
not sure polysemanticity is a meaningful enough term to do anything better than throwing gpt-4 at it tho
@bitter turtle desired wandb metrics:
- Features/tokens (number of non-zero activations per token on average)
- MMCS
- Full histogram of MCS
2 & 3 would need to be done every few batches to compare dictionaries learned at different sizes. This may be a headache to do syncing, which if so, maybe just at the end of training.
Could do it every chunk (~2M activations, 2k batches) without much hassle
Otherwise yeah
Not sure what you mean by 1
dict_levels.detach().count_nonzero(dim=1)).float().mean().item()
dict_levels is the latent dimension (ie feature activations)
Basically how many features activation for a given token: for the sentence " The cow", it may activate 3 features:
- animals
- words that start w/ "c"
- words that come after " the"
Thought: we could do sparse coding on the activations, except for the outlier dimensions.
I do this already
Good idea
Oh ya, sorry I forgot it!
@pallid current haven't used your autointerp stuff yet tbh, is there a function I can just call to get a score out for doing MCS-to-autointerp-score correlation testing
I'm off to bed, if you want to do autointerp on a bunch of the same dict I trained 16x iters for dict ratios 2, 4, 8 for l1=1e-2 on resid stream in multiple_iters_mcs_21_07 @pallid current, otherwise I can try and figure it out tomorrow
yo, no not like a single function, you can getl like 100 feats with python interpret.py and then in ae_utils.py there's functions for getting the score data into lists etc
need to refactor interpret.py to make stuff like that easier, maybe over the weekend
hmm i think l1=1e-2 might be too high to see much, i think like 90% features are mostly dead at that level
Gah
Which one do you think is best
I was just looking at the w&b logs but don't track dead neurons atm
1e-3 seems a very safe bet
here's what im basing it off
Mb forgot that existed 🤦
here's my writeup from the meeting, focusing on potential tests and metrics:
- suggests comparing the strength of the ablation effect with that found with neuron basis (i think there might have been more to this but i didn't catch it)
- should we compare perplexity to perplexity from replacing with non-sparse coding reconstruction with equivalent reconstruction loss, to see if we're capturing more important directions of variation than we would otherwise expect?
- similarly, do we see surprisingly low reconstruction costs if we only restrict to 1 or 2 layers downstream?
- can we check whether we're fragile to small quantities of nonsparse data?
- can we express reconstruction loss as a proportion of the total variance?
- can we see some relationship between the MLP and residual stream where features that are detected by the MLP are then more visible in the residual stream than they were?
- can we find examples where the directions we find match up to directions found by e.g. sparse probing (or maybe simple linear combos of a few directions etc)
- can we find good feature candidates from our dictionaries by selecting for e.g. variance explained, size/frequency of activation etc?
ok i've implemented some standard 'metrics' (MMCS, sparsity hists) on my fork using the shared interface, probbaly needs some tlc to get the plots looking nice, but i have it integrated with the ensembled training code
Will also refactor big run to be less cursed/easier to switch to running different experiments if that's something that'd be useful to people
Otherwise/after that I'll get on to this list
cheers all looks super good. just tried to run the big_sweep and had a few errors, likely due to merge so have fixed a few lil things on my repo
yeah i think it'd be worth taking a slight hit to efficiency to make it easier to customise runs
Ok, so this is a weird thing i've noticed. Ensembling is less of a pain when you use a functional interface but it seems that people generally prefer OO interfaces for testing etc, and so I'm ending up writing a lot of weird boilerplate to convert between the two.
Normally PyTorch hides all this nonsense but PyTorch also doesn't cooperate well with multiprocessing so I have to be kind of bespoke.
@pallid current are you using python 3.8 or 3.10 on the pod?
3.9
yeh im just getting some inconsistencies between dependencies etc between our branches like you noticed
why did you import GPT2Tokenizer
or did you remove that
from transformer_lens
do you want to do another pip freeze
yeah i think i somehow made a mistake in automerging the branches, seemed like 0 conflicts but i lost a couple of changes along the way
and yeah in your branch that wasn't being imported but at first i imported it from t_lens instead of transformers
afaict not needed, i get errors importing it but w/e
@keen pivot did you ever get weirdness where wandb images weren't uploading like half the time
oh it just takes ages for some reason nvm ill downsize them
anyway IMO it's pretty easy to configure for different runs now
kind of tutorial for configuration: https://github.com/Baidicoot/sparse_coding/blob/main/big_sweep_experiments.py
what do you mean by point 6? Like, translating MLP directions to residual stream space and see if they match up?
i think what i had in mind was like if we have 10 labels for things that an MLP neuron is doing, we would hopefully see that this feature had more of a coherent direction in the residual stream after the MLP has written back to it
and do it by probing for this concept using a synthetic dataset and measure the AUROC and degree of separation
as a side point, maybe we should clarify our language for the paper about features. I know nora for instance uses 'concept' to refer to the high-level human semantically meaningful thing, and 'feature' to mean 'direction in space corresponding to a concept'
(i was momentarily confused by the above)
labels meaning what?natural language description?
also neuron or dictionary direction?
also not sure how this relates to sparse coding, no step of that seems to relate to learned features, i've probably misunderstood you (unless you mean dict direction by neuron? in which case couldn't we just determine that by applying the MLP-post-activation-space -> residual stream and checking their similarity? I guess AUROC is better though, but then I don't see what the counterfactual is)
natural language descriptions of learned features
ok but what's the counterfactual? do we ablate those directions in the MLP post-activation?
i guess i'm not sure what role the counterfactual is playing? like in my head if we show that the direction we've found in an MLP, which seems to correspond to a human concept, also lines up with that human concept being more extractable in the model, then it's extra evidence that we're understanding what computation the model is doing in that layer
i suppose our test could be more powerful by comparing to a stronger baseline than no increase in concept-extractability
ok sure
right i get what you mean here
yeah that would also be a v good idea
i slightly misunderstood you i think
I guess like we could also do a regression of the activations of our learned features in the MLP activations and compare that regression's performance to the linear classifier on the residual stream, or something similar, otherwise im still confused as to what you have in mind
ok test i have in mind is:
- pick a concept which we think represents one of our learned features in an MLP
- use gpt-n to create a synthetic dataset of whether that *concept is on
- run a linear classifier on the residual stream before and after the MLP for predicting labels of the synthetic dataset
- check whether there is a jump in performance after the MLP
- check whether this jump goes away if we ablate the learned direction in the MLP
still not entirely sure what 'checking if there is a jump in performance after the MLP' gets us, but I'll get to implementing that
other than the synthetic dataset generation
also we can probably ask #1102791430549803049 or some other people about what classifiers/metrics/statistical methods are good to use if we want to enpaperify this
Was Pierre's intitialisation stuff particularly important or not really?
got this done other than datagen which is the actually hard part
It might be useful to plot explained variance vs achieved sparsity maybe
damn, crushing it! will be back on it on monday properly but will have a look now
i've never really understood the case where convergence speed was the constraint which led him to do the initialization stuff, and anyway i understood that it mostly helped the first part of training rather than getting the last bits of performance so i think it's unlikely to be useful. i think in certain toy models it was really slow but in real models it hasn't seemed to be necessary as well see with like good results from models trained in like 15 min. i think it'd be good at some point to run some like 100, 1000 epoch models and check whether there's increased performance tho
made a few little changes to interpret and the save code and it now seems to be able to interpret using the new arch
tho i think that it would be best if the outputs were saved in individual folders by default, just makes the processing a little bit easier
will hopefully finally get to running more tests on the sweep tomorrow morn
Each model in own file, wdym?
maybe interesting plot
This is for 32 L1 coefs * 3 dict ratios * 2 repetitions for each
On residual stream
This may have already been done before idk
but the implications feel pretty important
same thing with pythia-160m layer 7 (wondering if there would be a meaningful difference between hopefully-quite-different parts of the model)
oh damn that's really interesting
especially surprising that there's little obvious benefit to the larger dicts
how long were they trained?
yeah I feel like this is a decent summary stat for comparing different approaches; not sure what literature there currently is on this tradeoff
for sure
I really want to compare to synthetic data now, and synthetic data with some noise
is that like 2m activations?? should be a solid amount but would be interested to see if more changes anything
yeah about that i think
lemmie check
about 1.5m
I mean it seems to converge pretty fast
interesting that this one seems to have some benefit to the 8x ratio whereas the first shows literally none
yeah not sure if that's that significant
more testing with absurd dictionary sizes is probbaly called fro
for
wonder what happens as you take the size down
hmm
why
I guess we might be able to better predict stuff if we can see
how the frontier changes with dict size and extrapolate?
to see if there's a clear point at which the marginal dict element stops adding value
yeah i'm imagining that each run (which should have it's own folder as well cos it overwrites by default) gets saved as a folder which has the model in and then can also contain any additional data about that run, + autointerp
tho i do see why for like the l1sweep you just did it's easy to just keep them together, like its a question of whether we do more work on them as separate entities or as part of a larger collection
I feel like it's not that much of an issue to save it as one big file, but I have moved the hardcoded config out of the sweep function (including output folder)
not sure it's that much of an issue, might be nice to store metrics with the models, but we can still just save it as a big list-of-tuples
seeing one gpu being weird, not sure what pytorch internals take up compute wise but hmm
true for most metrics. i don't think it's a good file structure for autointerp which creates a massive dataframe for each dict and then a txt file for each neuron. obv the outputs could be put into one file structure but i wouldn't want to have a single file for the dataframes
could just put a filename/folder name/whatever
with the model
also, yikes yeah
that made me grimace
should pbbly do tests with different underlying dataset sizes to see the relationship but it seems pretty unchanging
fadedness is l1
because it looks nice
kinda hard to distinguish
but idc
Probably going to implement dead neuron resuscitation tomorrow and see if that changes anything
(like, resurrection for the first 5 chunks or something)
Hmm
Idk someone else theorise
which model/layer?
looking at the same data you had aidan i'm getting the sense that it probably hasnt converged at 10 epochs in terms of maximum sparsity for a given level of unexplained variance
or at least there's a jump from like epoch6 to 10
weird plot but in those lines of dots what you can see is the sparsity / unexplained variance tradeoff getting better at each epoch
might be because we're repeating data tho, i wanna try this setup but with fresh data
Looks like interesting stuff here, but I'm currently out of commission due to dental work today. Will hopefully be able to catch up/respond tomorrow. Will be meeting w/ Daniel M. today to talk about outlier dimensions.
Pythia 160m layer 7
10 epochs or 10 chunks? - but cool! We should start using the actual pile instead of 10k.
10 (actually 11) epoches over the pile10k i think, this is just repackaging the data from your runs
Oh, I like this metric. Only problem not included is the dictionary learning the identity (or unitary matrix) not showing up. But maybe that requires more data. I still need to do my part of showing a dictionary learning it at what sparsity & data amount
Wait, there are 11 chunks in that data, still not sure what you mean
If it's just the models from my run that's 11 chunks cool ok 👌
oh right is that one pile10k epoch?
Not sure what you mean; any unitary matrix is pretty nonsparse I think
Yep
ok that's encouraging cos it looks like there's a fair way to go in terms of performance if we crank up the data
it would be brilliant if we could consistently associate a point on the sparsity/explained variance space with a level of interpretability
I have past examples of above threshold-MCS across multiple l1 values and it’s like a U-shape. I interpret the top-left part of the U to a good disentangling of features (what we want) and the top-right to be the identity
Also for sure; linear regression time
If you want to run autointerp on all-or-some-of-those that would be Cool
Ah, but unitary matrixes show up on the high-sparsity-number tail on this plot
i plan to but obv there's a load of them so i need to be thoughtful about how to distribute interpretation
Yes I believe this. And I expect interp to negatively correlate with unitary-ness.
i guess if we're doing regressions over it, it doesn't matter that much if the individual measurements for a dict are noisy
Is there a way to measure unitaryness of our encoder matrix?
I don't think we need to.
I don't see anything particularly special about unitary matrixes vs other non-sparse dictionaries
Like, sure, it's probably more optimal to learn unitary matrices at lower L1 values but that's kind of just emergent. Like I don't see the causality going from unitary -> uninterpretable directions, it's more low sparsity requirements -> entangled features and low sparsity requirements -> unitary matrices.
More like I see the model diverge to a different solution at higher sparsity in a more abrupt way, and I think we can measure that. Maybe that’s unitary, but it’s good to check that abruptness
oh, that's just as the sparsity approaches the number of dictionary atoms so it changes behaviour; it probably is learning a kind-of-unitary matrix then, or something similar
i dont think we can meaningfully talk about unitary non-matrices when the number of features exceeds the activation dimension
and we know from the fact that the sparsity reaches those v large levels that its not just a unitary matrix + a load of empty rows
Could still plausibily be a some rotation-ish-type-thing of that but I don't think it's a particularly good line of inquiry
rotation of zero would be zero, but yeah i'm also not that interested (edit haha fair)
how're you getting that from recent results?
Thinking about how we can get max performance out of this; I think it's probably not that important
been working on getting the synthetic dataset for ablations done, havent had that much time today but getting there, off for a quick bday drink, will finish in the morn 🙂
@pallid current all dict sizes have similar numbers of neurons that fire at least once every 10k samples at a given sparsity level; this probably makes comparing autointerp between different dict sizes easier, but is also probably something we definately don't want to be happening
one sec let me loglog this for readability maybe
uh this is slightly weird
I feel like my data is noisy ill up it to 100k
okaaaay
that... did not change the y scaling at all, the higher sparsity dicts are just learning the same/lower numbers of features/zeroing more features
I feel like we may see some changes in these plots comparing tied & norm vs not
I'm going to write some simple code to find outlier dimensions so we can ignore them, though I may need @bitter turtle 's help for integration (because I'd like to see variance explained vs sparsity).
Current thought:
- find outlier dimensions
- change mlp_width by # of outlier dims
- Index by outlier dimensions when running data through
For variance/sparsity: the outlier dimensions will be 100% explained but account for a permanent +1 in sparsity for every outlier dim.
If this does improve things, we'd need to compare w/ leaving out random dimensions as opposed to outlier ones
For Pythia_1.4b, there are several dictionary features that activate a large fraction of the time. Percentages of non-zero activations:
tensor([0.5632, 0.4849, 0.4426, 0.2947, 0.2258, 0.2044, 0.1533, 0.1482, 0.1231, 0.1150])
So the first two activate half the time.
Replicating some "ablating outlier dimensions effects model performance a lot relative to other dimensions", I ablate the top-10 outlier dimensions (ie the residual stream dimensions w/ the highest activations) both on their own & cumulatively. The majority of perplexity diff is caused by ablating the first two dimensions.
Notably, ablating the first two dimensions together causes worse performance than ablating each individually, meaning they have overlapping mechanisms in the model (which mechanism, who knows).
What's important for dictionary learning is I can just keep the first 2 outlier dimensions (& hope those dimensions don't also do feature representation), do dictionary learning on the rest of the dimensions.
We can also see what happens if we do dictionary learning on other sets of outlier dims (e.g. just the top outlier, top-5, etc)
@pallid current, do you have this graph/sweep in your repo?
it's in sparse_coding_aidan/hoagy_outs, they're the ones with _t on the end
script is same folder, frequency_plot_h.py
Sorry, I meant the code to run it. I want to train dictionaries on the residual stream except for 2 outlier dimensions. I want that graph to compare against baseline.
Ah, I think it's frequency_plot.py
^
@keen pivot I plan to look into ways of working around nonsparse data more generally, interested to see what you find with this!
Nothing too great atm! Looks about normal at first glance.
This is Pythia-70m, but need to check against normal runs
Yeah wasn't expecting it to be too significant; could you put the normal runs when you do them?
You ever get a:
UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::addcmul_.
error?
& how long does this run take?
not an error, just a vmap warning, you can suppress warnings with python -W ignore filename.py, also about 15m maybe?
The "remove outlier dimensions" one is surprisingly much worse. Maybe a coding error on my part.
Are you still considering those dimensions when calculating unexplained variance?
like, how/where are you ablating them?
I'm not considering them when calcuating
I just remove those dimensions from the batch every time.
I'm unsure how much adding back in the outlier dimensions for unexp var. will improve, but the results look a bit dramatic
how aremyou removing those dimensions?
like, projecting the dimension to zero?
Could you send the code?
also @pallid current @keen pivot are you guys currently using sparse_coding_aidan for work? It would make sense to have your own (stable) copies of them if so
Agreed! Is this pushed anywhere?
sample = dataset[sample_idxs].to(torch.float32)
indices = torch.tensor([i for i in range(sample.shape[1]) if i not in outlier_dimensions])
sample = torch.index_select(sample, 1, indices)
as of just now yes
uhh
are you editing the models as well?
would be easier/make more sense just to zero them
probably equivalent
They're near equivalent. The only difference is the shape of training across time (the left one is outliers out, and right is original)
Yep, I edited the model as well.
oh, slight improvement though!
Ya, may be entirely explained by just getting the outlier features for free.
for free?
Because I just added the outlier features back in, so it's just adding in an extra 2 0-dimensions which effects the .mean()
doesn't look like enough
my hypothesis here is that the model dedicates a couple-or-more features to learning these outliers slightly noisily, and looses performance on them. more interested in training where you train the dict on not those directions
Ah, well the performance gain isn't great. Of course you'd expect it to perform better initially because the "remove outlier" one perfectly reconstructs the outlier dimensions, while the other one is still learning to represent them.
Is this what you meant by enough? or enough on what metric?
to cause the change just by adding in the 0s
should I copy the directory for hoagy?
Can we verify this? so I can run the normal model & 0-out those dimensions in the output for both the datapoint & reconstruction & see if they're the same
yo, i'll copy across the file i worked on in your directory, otherwise im working in sparse_coding_hoagy
cool
I moved to sparse_coding_aidan_new until @keen pivot can move then I'll move back
don't want to overwrite your files etc
Thanks & sorry!
np
done with moving or?
Yep! Or really I don't need anything from what I've run
oh ok cool ill delete the old folder
think i've got the code in place to do the ablation tests but current being blocked by getting super bad outputs from the simulation which is odd
oh cool! good luck with that that sounds awful to debug
doing some runs with sparse activations + symmetric noise to see if I can get anywhere close to replicating the weirdness seen here; if not I'll scale down my ambitiousness and test for fragility etc etc etc
toy data?
getting the weirder result that the bad simulated data is the same as the kind of responses we get from the standard interpret.py runs, which succesfully pick out correct explanations
not sure wym
like text of [john raised $ 6 million], activation [0, 0, 10, 0, 0], gpt4 interp: "currency symbols, esp $", gpt3.5 simulated data: [0,2,9,9,2]
gets a score of like 0.3
so the explanation is basically perfect, but the simulated data is pretty crap (actually worse than impression in this example)
but it's not like there's a load of free params that go into the simulate function, its super simple
also seems like it should be trivial for 3.5
hahaha i got such a fright when i tried to look at the prompt as it's built internally because it looks dreadful e.g. six\tunknown\n-\tunknown\nyear\tunknown\n deal\tunknown\n but\tunknown\n Str\tunknown\nud\tu but they just add looads of unknowns into the prompt and then get the logits at every position, its super weird
but as far as i can tell it's creating the prompt correctly
maybe gpt-3.5 is just quite crap at the task??
annoyingly i can't switch to gpt-4 super easily, the instruction models use a different endpoint
Larger dicts do require more training to show a larger benefit, but the biggest drop in reconstruction loss is definitely caused by L1.
What’s the simulation data supposed to be doing? Or the broader context of the experiment?
(whoops forgot to hit enter) this is for an experiment we'd like to have to verify that the model is in fact using the directions we've found to compute those features. want to check that
- the feature is more clearly separated in the residual stream after the MLP layer with the feature than before
- this is no longer true if you ablate our feature
and for this we need a dataset of 'is the feature on' which is basically what the autointerp stuff already has
Is this the same thing as before?
had some ideas about using residuals/skip-connections in the multi-layer autoencoder, found out that it is basically this http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf, testing now, looks GREAT so far, some training instability, but I think I just need to vary the lr for the encoder + dict separately (training both with the same LR atm, probably shouldn't, or should train both for a while then freeze dict and finish training the encoder)
@pallid current are you running interpret.py? there is ~2GB on GPU0 and I presume that is you
okay this has like impossibly good performance maybe? running benchmarks soon
tf
forgot to norm dict 😅
this is on pythia-70m layer 2, looks much the same
(except for the insanely low-sparsity case)
lemmie get a comparison real quick
cool, still significantly worse than normal, but at least it is actually training now, unlike before
I feel like this is a slight step forward
Could you elaborate?
well, we couldn't get it to converge at all before, and it is now
ideally we want to have a number of different approaches we can try out to find the best one, and this one is pretty close in perf to our best one
errr might have left a pdb or python interpreter on, sorry
we have the reading group thingy today right? in about an hour? @pallid current @keen pivot
@bitter turtle have you looked at what the sparsity / unexplained var graph looks like for non noisy toy data? would be good to be able to show the difference between that and the pythia results, especially if there's a very clear difference
yep, was planning to get around to it, got distracted by this
cool nw
can we also do a restart at some point soon, I think there's like 2GB orphaned data just chilling on GPU0
Do you suppose models across time have more superposition? It's gotta learn it sometime, so maybe we could measure it somehow w/ dicts or maybe a tool more suited for it?
@bitter turtle , What's the experiment w/ adding noise? (or is this the de-noising encoder?)
This looks settled now, right?
i can check in ~5m wasn't last time
@keen pivot
top looks pretty empty now, i did leave a pdb open overnight so i think it was that sozzz
What is "the weirdness seen here"?
like, we should expect dictionary learning approaches to be better/there to be a range of l1 values converging on the same solution under the assumption of the activations being well-described by a sparse basis
Doesn't MCS capture the "same solution" more accurately?
Although they have more nonzero activations (higher sparsity), this could just be allowing more noise.
Does that make sense?
yes, for sure, but we don't have access to the ground truth so we can't compare to that
oh, sorry, I see what you mean
not sure about this
agree with this
Agreed. This (ie higher sparsity is just more low-activating noise, not a significant difference in features found) is just one hypothesis. The MCS across different sparsities may better capture this.
I still think it's weird. Also, plausibly, there are a bunch of possible sparse decompositions which are equally powerful in terms of describing the data, especially under the assumption of noise
I think that we can't really say until I compare to the curve for truly-sparse synthetic data
if it turns out that the same curve exists for that, the metric probably can't be used for this kind of analysis
I do think that the metric is useful from a pragmatic perspective; in the abscence of one true ground truth, more sparse decompositions are maybe intrinsically valuable lenses to view activations through
So you're saying two dictionaries could have similar sparse decompositions & reconstruction loss, but have low-MCS relative to each other?
yep maybe
You could have the sparsity/variance explained graph, but also track MCS across dictionaries. If models do have high-MCS w/ nearby sparsities, then that's good evidence for them converging on the same decomposition.
good idea
Thanks!:)
I'm currently working on dictionaries across different layers. I could code up the MCS across layers one w/ your repo tomorrow.
I think I'd also want to train models on more data (like 30 chunks w/ pile?), but I think y'all had experiments showing more training data didn't really affect the sparsity/variance explained?, but that's different than MCS.
latest results from using the more complex, multi-layer denoiser; looks almost slightly better than dictionaries? number is #chunks
changed initialization + switched to GELU
will do no-noise test w/ synthetic data + normal methods next
more complex one is on left?
Oh, sorry, I forgot discord might change the order of images
More complex one has highest sparsity ~400 on x-axis?
This is just MCS for Pythia-70m residual across multiple layers
Oh, sickkkk: what do the same layer ones look like?
Ah, it'd be a perfect match? I don't have two same-sized trained dicts of diff-initializations to compare against
I think that's a really important baseline to explore
Have you tried the testbed thingy yet?
So ture
What is that?
The standard_metrics.py thing; using a common interface for many different dict types
Nope!
I've seen that file, but currently still in my old repo since these dicts are pickles
Oh mb I forgot I'm away this weekend from tomorrow, could one of you two look at doing this if you want it soon? Shouldn't be too hard to edit my big_sweep_experiments.py, I'll push it in a min
If you can, get your training code to use the new one maybe it'd be nice to have common comparisons; I've got most of the way to MCS hists in there so far but haven't yet
Like comparing the MCS hists are the same, and others for the sake of making sure our code is correct?
Wait, it's like Wednesday. Are you leaving tomorrow?
Looks like Hoagy thumbs-up reacted to it, so I'm assuming he's got it handled!
huh that's super interesting, so at 7 chunks, even the highest level of l1 is using 100 feats per token?
yeah if its just running existing code but with noise param set to 0, seems chill
How are you getting l1-level from this? L1 is encoded as opacity of color, right?
Great, thanks for taking care of this!:)
yeah just tracking the lines to the point of highest L1 /lowest opacity,
Oh, ya. I see it now. Ya that's weird
Would be interested to compare the two when training for 10x longer
You might need to change stuff around actually I can set it up you can run + debug ig
I changed the L1 range all others got entirely dead
Whoops
This bit is weird tho
For MCS across dicts of different sizes (as a baseline that's better, but not as good as dicts of same size/diff init). Notably layer 5 is sucks. Also, layer 2 was trained differently than the others, but I don't have the hyperparams or amount of training data on hand.
is layer 5 just before it goes into the unembedding matrix?
Yep
that's very interesting, especially that it's not bimodal
someone asked me about this recently, like how confident are we in the model where the residual stream is written to and read from, while the mlp calculates updates to the mlp, but doesn't really hold much of the information
i think the success of the logit/tuned lens is the main point of evidnce for it but i couldnt think of much other emprical evidence
i think if that were true and sparse coding was working perfectly we would expect bimodality
not sure what it would mean for the mlp activations to hold information in a way that would meaningfully invalidate that model
the residual stream is the data moving between layers
Throughout the model?
so like, when you calculate resid4 = resid3 + mlp3 + attn3, the assumption is that the mlp does not contain most of the information in resid3, its instead calculating smaller volumes of new information. and therefore, for information to persist between layers, it must be present in resid4 in the same form that it's present in resid3
whereas, if mlp3 contained most of the info, there would be nothing preventing resid4 from having a very different representation to resid3
right, volumes of information, I see.
I feel like the distribution of bimodality across layers is very dependent on the way the gradients flow through and I can't model that that well; I think there was a paper looking at something ~this
You should ask the tuned lens people
how many reinitialisations have you done @keen pivot
I've asked in #interpretability-general
My brain is sliding off this. What are the two different hypotheses being compared here & the evidence for either (and bimodality of what? MCS across layers?). My attempt:
H1: residual stream is the memory of the model that is written to & read from by e.g. MLPs. Each MLP only does a small change to the residual stream, so the representation should be mostly the same across layers.
H2: Most of the information goes through MLPs, so we shouldn't expect similar representations across layers
Oh, and for the record, I also worked on tuned lens
totally did not realise, sorry 😅
It's what I get for going by different names, haha
Okay, the across dict stuff looks like I need to do ACDC ablation stuff:
- Find a cool feature in layer 5 (check)
- Ablate all features one-at-a-time in layer 4 & sort by drop in feature activation in 5 (could also ablate all features, and then restore one-at-a-time)
- Investigate those features found
- Repeat for features found in layer 4.
I should be able to do in like 4-5 hours, which will be a tomorrow thing. I'm about to go watch barbie though.
it's v good enjoy
@pallid current ran the test on no-noise, all l1 values (tested over the normal range here, same as the last one) converge to low-sparsity solutions (btw the sparseness measure I built for wandb is broken atm) as we predicted. falloff for lower-sparsity solutions is similar
this is Quite Weird.
like, the really low l1 ones still converge to sparse solutions, I think this is maybe just a search-not-wide-enough thing
riiight so in that case the sparsity at about 7.5 is like the true amount?
still, I think it is ~what we were expecting
10 but yeah
ok so yeah seeing super big differences in the role of noise
phew
this is progress then
I had an idea for avoiding noise; instead of peanalising l2-norm of residuals, we should peanalise cross-entropy against a normal distribution with a learned covariance matrix (and also probably peanalise the size of the covariance matrix)
this might probably turn out to be equivalent, but who knows. probably people do, but I'm not them
I expect this to be equiv
hmmm so the only loss signal would be from our ability to fit to this learned cov matrix?
so basically at that point the learned cov matrix is assumed to encode all of the important information?
yeah, this is assuming that 'that which is not sparse is ~normally distributed'
i feel like that would throwaway most of the important info, because it wouldn't be able to represent spikes in the distributions properly. like openai's approach is basically to take maximum distance away from the normal
and i think you'd still want to do that even if you had a learned cov matrix
tho i guess that's the diff
oh, so, we learn a sparse dict, but replace the 'minimise residuals' thing with 'make the residuals fit a normal distrinution with low variance'
riiiiiight sorry i getcha
seems rather convoluted but it guess it could work. for me the question is, even if this is what we'd expect to see in practice, is there a reason that we think that minimizing l2 norm is the wrong thing to aim for?
like there are cases where fitting the normal dist is wrong, because that variation can in fact be captured
whereas i'm struggling to picture the case when performance would get worse by trying your best to minimize l2, even if some noise is irreducible
I'm skeptical because like linreg works with minimising squares, and that works with normal noise, so why shouldn't this.
yeah pretty much agree
interesting idea tho, shades of VAE about it which also made me go ?? at first but works
desmos crushes my dreams once again
this is literally quadratic
I guess this should be expected, linreg people know what they're doing
linreg mafia undefeated
might be something in the VAE thing maybe
might be slightly more stable to train/converge better otherwise no ideas atm
actually probably not
tied vs untied complex multi-layer denoiser; seems about the same perf
seems like that kind of pattern is quite robust.
yep
btw got it working so i can run interp over a gigantic .pt of learned dicts
still the q of exactly what to run it over
the last one was with a multi-layer encoder
I got that converging correctly this morning, it gets slightly better performance than normal at the cost of training speed
specifically, a 3-layer one with residual skip connections; turns out residual connections was all you needed to get it to converge
fixed the bug
(wasn't telling the dict to be normed - again - when I was saving them) @pallid current
meant I lost first batch checkpoint
cool, that looks to be slightly better than untied!
pushed current code, off for the weekend
have fun!
oh i misunderstood the graphs above, but then how can a multilayer encoder be tied?
will restart in the morning if it's not magically fixed and have messaged curtis
feel free to restart if you need logan
So, it converges only with skip connections, was just messing around with different configurations and it seems to converge best (utterly untested) when it's just embedding linear transformation -> denoising layers with skip connections -> bias + ReLU rather than embedding linear transformation -> denoising -> another linear map -> bias + ReLU, so I slapped W_T as the first linear map
oh hey, gotcha
btw is the dense_l1_sweep output saved somewhere, the untied ones?
not atm
ah k
yeah will do once i restart the kernel
tho in general would best to save exp funcs cos there's quite a lot of params just set in __main__
sure
I'm just using pythia-70m layer 2 residual for everything ATM, literally just the data in activation_data
it's like the older training setup in that you can lie to it and just change the dataset folder and it won't check that it's the right dataset folder
I also doubled the batch size, I trained the other ones with 1024
Also try a Layer Norm? Basically a Transformer w/o attention
It's also sounding more like the soft-prompt literature, which began simple, but expanded to more transformer-like models to train the soft-prompts.
I mean, you probably actually wouldn't want to do that, you need to propagate magnitudes through, and +ve activations definitely aren't centered, but it would be funny
Like, I think this is mildly principled in that it probably inherits some of the convergence properties of linear encoders but is slightly more powerful
layernorm would kill that
I don't think the trade-off is good enough to spend a lot of energy looking into it though, maybe if we get stuck after training on lots of data
like, potentially once the dict converges with linear encoders we could freeze it and train this to do better sparse coding for that dict, and maybe iterate, but that's a while off. More interested in looking at the shittons-of-data case atm
Is that just bottlenecked on loading in Pile & adding more chunks to train through?
yes, we should do it soon
Got a graph of related features. The original one is words in parantheses related, & the others appear to be similar as well. Will look into the details soon
😮
I also want to cluster feature directions as well. In general, and here we could color-code features by their similarity, because some of these may just be the same direction across layers.
not sure i understand this but the codebase is already set up to do larger runs
currently running a 100-chunk sweep on the pile with the dense l1 sweep
Note: this too 6 minutes, which isn't too long, but will get longer w/ larger models/more layers
how are you running this?
Sick can't waiit
definitely seeing diminishing returns, up to 20 epochs atm and it's still improving but barely
continued improvement is most noticeable at low tokens per activation_v tho so might still see a decent jump by 100 epochs
this is with dict_ratio = 4, so we can also see if this looks different with higher dict ratios
what is 'low tokens per activation_v'
just low sparsity (on the graph) , except that is what we'd usually call high sparsity so its confusing
yep
clearly i didnt improve the situation 😆
call the x-axis thing 'sparsity number' maybe
think i might go for 'average active features'
dont know why i keep calling features tokens
yeah that was confusing
I like average features/tokens
Yeah the tails seem to be converging to the center which is maybe good and promising; actually no, the right tail is converging to 600 or so features, that's awesome, we should check the mcs of that
On GPU. But you have to make features by layers amount of causal interventions, so can't really batch that. Could run different paths on diff GPUs
Didn't someone speed it up by like 200x using activation patching or whatever it's called recently?
Think they were in the UK seri mats cohort
The left is acronyms & right-two paths are dates related
Looks like overall, this direction in the last layer wants to up-weight an end-paranthesis, and there's two paths: end of acronyms & end of dates after an opening paranthesis
are there established metrics for goodness-of-graph? thinking we could use something atticus gieger-like to measure how descriptive graphs we find using the sparse basis are compared to how descriptive graphs we find on the neuron basis
Oh ya, probably! I'd integrate that. That'd make the feedback loops much better
Like icl this definitely can be a paper if it turns out that these graphs are computationally meaningful
This is residual stream. It'd be great to connect it to MLP's
Oh, I know, giegers work can still be applied here
for sure want to connect to MLP as well tho
does he have an explicit goodness of graph metric?
I think he has a metric for 'alignment of causal hypothesis to model' and so basically we
- find a graph with adcd or whatever the acronym is
- come up with a hypothesis for what each node in the graph represents in an abstract computational model of the circuit
- throw the metric at it, which compares the abstract model to the model in the transformer
if we can find 'natural' circuits/graphs that are well described by abstract high-level causal models that would literally be fucking insane
I was thinking today about what kind of things we really want for a draft paper if we go that route and demonstrating the ability to find circuits in models using the sparse basis is for sure up there like number 1 priority
I guess it's the "computational" part that doesn't work, but I do think you can still do causal alignment here.
Expecially w/ the speedup, I could quickly find really good examples to test.
Slightly different graph: This is for feature restoration. Basically, I set the activation to 0 & check how restoring that feature accounts for the original activation.
What do you mean; we can still find computational graphs using purely residual stream data. Gieger's stuff is pretty implementation-agnostic, it just checks values of 'variables'/intermediate points in computation/values at nodes of the compute graph, it doesn't care how the computation is actually implemented between nodes much (IMO this is fine and also a good thing)
Also, we should probably be using FISTA/OMP etc for doing compute stuff
Or like, generally some better solver than dot + bias
What do the % here represent?
Percentage recovered or ablated. This case percent recovered.
So if the original activation was 5, and we ablate everything and recover one feature, how much of the original activation do we recover?
ran another long run with l1 sweep, this time tied, seems very definitively no difference, to the point where i'm checking i'm not just plotting the same data twice (don't think i am) (crosses are tied, legend is n_chunks)
set off a big run to analyse 50 feats from all 32 tied l1 values, about $350, no idea how long it'll take, i guess a few hours?
This might be a good objective measure, not sure about how many comparisons there are/what baselines we could use, maybe we should ask Neel or someone about relative measures
autointerp results from the sweep are odd and kinda worrying. not seeing the rise from low to mid l1 values that i expect, or at least not as robustly. then goes off the rails at we get to high l1 vals. need to adjust the approach to increase the number of features analysed to correct for the fact that most will be dead or v rarely active
need to run l1 = 0 runs to see if they are much worse than l1 = 1e-4
i'm quite worried by the fact that 1e-4 is so good
have left a run ongoing while going to bed, will almost certainly hang at the end due to wandb issues, if its preventing anyone from doing a run, just kill
I agree this is a good measure. I don’t think we really need a baseline beyond “it doesn’t work in neuron basis”
Ah, but to measure it we need to
- make a choice of which graph to look at
- make a hypothesis of which algorithm the graph implements
which seems hard to make standardised when comparing neuron basis and sparse basis
I mean you can probably standardise the first one fine
Hmm maybe it's ok
What’s the neuron basis score here?
And I predict the identity to be learned around 1e-5, which is usually like 500-600 sparsity, though I’m confused about the graph. Is 1e-4 corresponding to a sparsity of 800?
Ya. I think if we just get this to work in a real LM, then we’re golden
Im also making different choices to make the graph (eg ablation vs restoration), which can be compared against each other as well
trend seems about right if you just look to l1=1.6e-4 but those last few ones around 1e-4, esp 1e-4 are just weird
about to run 0, 1e-7, 1e-6 and 1e-5
btw im pretty sure that we're overloading wandb when we do our final upload, leaving it to timeout indefinitely, i think for big runs for now we should just turn it off - i haven't been looking at it at least
@pallid current, would you also be able to see MCS between different l1-value dictionaries? If there are N l1-values in the graph, this could be shown as the lower-triangle of an NxN matrix.
I said I'd get to it yesterday, but doing the causal alignment stuff atm.
I also want to do the outlier features, but out-of-scope for this project. Might try to pawn it off.
yeah good shout will do in a sec
ran the mmcs matrix, got some very simple plots running in the notebook mmcs_plots.ipynb in my workspace
most interesting thing i can see is that there's some kind of peak around l1=0.001 where even the highest l1 values near l1=0.01 match most closely to 0.001, i guess because those are the features they would learn, if they weren't mostly dying
e.g.:
peak mmcs in the whole matrix is just above 1e-3
low mmcs match best with other low mmcs though there is a noticeable hump around 2e-3
oh also got the results back for the super low l1 baselines. they dont outperform neuron_basis on random but do slightly for top and top-random, so there's something just by the nature of putting the data through a relu that is causing some level of screening, need to be careful about this when making claims and baselining
Are you able to save this as a matrix for all l1 values & plot a heatmap of the matrix?
heat map, tho i think it's quite hard to read
Do you have a legend?
Hey, coming here from the mech interp discord where you presented this week. Amazing stuff.
I was looking into using your techniques for interpreting the TinyStories series of models, but as the first step in doing that I'm trying to come as close to reproducing your results with your codebase as I can. Couple of assumptions I'm making on my end while trying to do that which I am not sure are correct:
- The canonical repo for this work is https://github.com/HoagyC/sparse_coding.git : This one seems to be ahead of all of its forks
- The canonical way to run this code is to run
python run.py <args>: This is what it says inREADME.md, but I do note a lot of recent activity in the filesbig_sweep.pyandbig_sweep_experiments.py. I also see changes tointerp_notebooks/feature_interp.ipynbthat are more recent than the changes torun.py, and the definition ofclass AutoEncoderis different in each one. - The canonical way to verify that the artifacts are finding real feature directions is to run
interp_notebooks/feature_interp.ipynb.
Asking because I did a run with python run.py --epochs=3 --save_after_mini=True --l1_exp_low=-14 --l1_exp_high=-10 --dict_ratio_exp_low=1 --dict_ratio_exp_high=7 --layer=2 --use_residual=True --use_wandb=True --wandb_entity=<my_name>, and it did generate outputs in outputs/20230729-195836/0/auto_encoders_2.pkl, but the results of interp_notebooks/feature_interp.ipynb seemed a bit off (after I did some extremely sketchy stuff to get that notebook to run at all).
Not urgent, I'm trying out seeing what happens when I use autoencoders.tied_ae.AutoEncoder in run.py to see if that helps
hey 🙂 yeah that's the right repo and those arguments look reasonable. it's in a funny state because run.py was the original code to run but we've been doing a lot of large hyperparam sweeps with mutiple GPUs and Aidan rewrote the code with a very different architecture so i barely know what run.py does at this point, i'm happy to talk through for a bit if it's not really working
we should do a cleanup that allows a simple run soon
but if l1 and reconstruction loss are both falling then it looks like it should be working ok but i don't knw the status of feature_interp.ipynb, @keen pivot can you help?
If sweep(ensemble_init_func, cfg) is the up-to-date method for running one of these experiments, I can write an ensemble_init_func 🙂
well that looks promising
https://github.com/HoagyC/sparse_coding/compare/main...JoshuaDavid:sparse_coding:main#diff-9ef165e04d52c6850bd88e31d401d1eb218baf46492f5974292eb6d32f739350 is my initial crack at an ensemble_init_func which will do the same thing the old run.py did (minus the "compute and store activations if none exist" step). Currently running via
$ python run_using_sweep.py --epochs=1 --save_after_mini=True --l1_exp_low=-12 --l1_exp_high=-10 --dict_ratio_exp_low=1 --dict_ratio_exp_high=4 --layer=2 --use_residual=True --use_wandb=True --datasets_folder=activation_data/pile-10k-EleutherAI/pythia-70m-deduped-2/ --wandb_entity=joshuadavid
ETA is ~7 more minutes (I'm running on an instance with only one GPU)
Edit: there were two chunks. ETA is actually still <t:1690696320:R>
Run finished without errors at least
Yeah, it looks like using the stuff in big_sweep.py was probably the way to go. I'm still a little bit stuck in feature_interp.ipynb, since I'm not entirely sure how to convert an autoencoders.learned_dict.UntiedSAE into an AutoEncoder (or even if that's something I should be doing). But. It looks like it did something that is approximately what was done before.
When I take the size-2048 dictionary, and plot the pairwise cosine similarity of features in that dict (excluding pairs which are the same feature, i.e. feat_0 x feat_1, feat_0 x feat_2, ... feat_0 x feat_2047, feat_1 x feat_2 ... feat_2046 x feat_2047 but none of feat_0 x feat_0), I get a nice normalish-looking distribution centered around a cosine similarity of 0. And when I take the size-1024 dict and the size-2048 dict, and do the MCS thing there, there are some features that are learned by both. Though not as many as I might hope. Graphs in question, as well as the script to generate them, attached.
Anyway, I have a suspicion that the issue here is just that I trained on all of 2 chunks for 1 epoch. I'll try setting up a 5 epoch run on 30 chunks overnight, see if that gets better results.
# Terrible hack of --epochs=0 to get the chunks into activation_data without having to use anything else in run.py
python run.py --epochs=0 --n_chunks=30 --save_after_mini=True --l1_exp_low=-13 --l1_exp_high=-12 --dict_ratio_exp_low=1 --dict_ratio_exp_high=2 --layer=2 --use_residual=True --use_wandb=False
# And retrain overnight
python run_using_sweep.py --epochs=5 --save_after_mini=True --l1_exp_low=-12 --l1_exp_high=-10 --dict_ratio_exp_low=1 --dict_ratio_exp_high=5 --layer=2 --use_residual=True --use_wandb=True --datasets_folder=activation_data/pile-10k-EleutherAI/pythia-70m-deduped-2/ --wandb_entity=joshuadavid
Training on more data should help. But I’m curious what sparsity you’re getting in wandb? Which means how many features/token you’re getting.
I think the minimal_feature_interp on my repo should be better. One caveat: if you’ve saved it as a pickle, my code will work. If not, you’ll need to replace the pickle load with torch.load()
Ya the MCS histogram should look much better than that (again, more data, and check the l1’s effect on sparsity)
Looks like sparsity shows as 200 in wandb for l1=1e-3, dict_size=2048? If I'm interpreting that right
Yep, that looks reasonable. I think a better sparsity is around 20-50, but that’s still up for debate.
That would mean upping the l1 term.
Are the pictures from retraining overnight on more data?
I’d be curious to see the results for MCS across two dicts
That's from the runs I looked at last night, haven't looked at results from overnight runs yet
Edit: speaking clearly is hard
Ah,what size is the model?
One of the runs in the sweep was a dictionary of size 2048 with l1 of 1.78e-3, wandb says sparsity of ~65.
Model is pythia-70m-deduped-2
Have you checked the MCS of this one?
Hmmm… I can look into it in more detail tomorrow. Definitely doesn’t look right.
Random vectors would be around 0.4 I think for MCS so it’s kind of weird.
I think "MCS just under 0.2" makes sense for the best-of-2048 samples of random normalized 512-dimension vectors, at least based on some quick hacking about in a repl
>>> a = np.random.rand(512, 1024) - 0.5;
>>> b = np.random.rand(512, 2048) - 0.5;
>>> a /= np.linalg.norm(a, axis=0);
>>> b /= np.linalg.norm(b, axis=0);
>>> print("\n".join([
f'{b:.2f}-{b+0.01:.2f}: {ct}'
for ct, b in zip(*np.histogram(
(a.T@b).max(axis=1),
bins=100,
range=(0.0, 1.0)
))
if ct > 0
]))
0.12-0.13: ###
0.13-0.14: ##############
0.14-0.15: ###############################
0.15-0.16: ##########################
0.16-0.17: ###############
0.17-0.18: ########
0.18-0.19: ##
0.19-0.20: #
I suspect I broke the autoencoder training code, I'll look into what's going on and come back with an update once I figure it out
Where can I see a doc summarising what has been done so far and the plan ahead?
Hoagy did a write-up on lesswrong: https://www.lesswrong.com/posts/ursraZGcpfMjCXtnn/autointerpretation-finds-sparse-coding-beats-alternatives
We unfortunately don't really have a hugely concrete plan ahead atm, but we're working towards a draft paper at some point
Just a minor thing: these vectors aren't uniformly distributed about the n-dimensional sphere, you'll have slightly higher density in directions like (1,1,1,...), and slightly lower density in directions like (1,0,0,...), so your distribution might be a little off. To sample from the sphere you should normalise Gaussian-distributed points (i.e. randn).
Oh yeah, that is correct. Though it doesn't seem to make a huge difference. Changing the first two lines to use np.random.randn and rerunning makes any difference at all but not a huge one.
a = np.random.randn(512, 1024)
b = np.random.randn(512, 2048)
a /= np.linalg.norm(a, axis=0);
b /= np.linalg.norm(b, axis=0);
print("\n".join([
f'{b:.2f}-{b+0.01:.2f}: {chr(0x2588)*(ct//8-1)+(chr(0x2588+(ct%8)) if ct%8 > 0 else "")}'
for ct, b in zip(*np.histogram(
(a.T@b).max(axis=1),
bins=100,
range=(0.0, 1.0)
))
if ct > 0
]))
0.11-0.12: ▉
0.12-0.13: ███▏
0.13-0.14: █████████████████████▊
0.14-0.15: ███████████████████████████████████▍
0.15-0.16: ███████████████████████████████▊
0.16-0.17: █████████████████▊
0.17-0.18: ████████▊
0.18-0.19: █▎
0.19-0.20: ▋
0.20-0.21: ▉
0.21-0.22: ▉
Also I note that the original cosine sim graph shows that a nonzero number of features in the size-1024 dict have MCS >> 0.2 with ones in the size-2048 dict, so whatever I broke didn't cause quite entirely random features to be returned. Just very very close.
I think you can probably find this analytically, but eh
Might be good to write out our current plans for the week @bitter turtle @pallid current if you want! For me:
- Causal Alignment - write up fuller project for this & implement.
Extra-Todos: @weak meteor
- Look through many examples of the causal alignment stuff (mine is the parantheses example) to find a cool one
- Implement early-layers-to-late-layers, cause atm only can do later layers to earlier (related to causal alignment)
- Look into outlier features for a few days, try to find cause, try to pass torch to someone else
- Use features for activation engineering (may require training on large LLAMA models, which requires switching to baukit for training cause of GPU's)
[Note: Roko, I don't expect this to be clear TODO's. Can explain more]
Beat me to it! Was going to do exactly this this evening. I've asked Neel Nanda what kind of metrics he would like to see for causal alignment via email, hopefully will get back soon. If he doesn't I'll say fuck it and ping him or something, anyway. Will write up plans in about an hour and a half when I get back
Could you also elaborate on 2) for me? Tracing the causal path forwards doesn't strike me as immediately obviously useful. I also think 3) is a bit annoying but pretty universal, have heard some people propose solutions for newer archs, but it seems better to figure out a way to have our models ignore them
- If a mid-layer detects Dates, we could find many further layers that make use of this feature. We would still be able to make causal alignment statements here.
- There's a few papers on outlier features, but nothing mentions the \n or "." that I see in the outlier dimensions (which I've verified in just the outlier dimensions is a consistent token, but only in Pythia, not gpt-2 or others) which is novel AFAIK.
@bitter turtle I'd like your thoughts on this. Causal alignment (CA) feels circular here (at least for our use-case). CA assumes you think these parts of the model does some algorithm, which you can verify by changing the parts; if they have the same effect on the outputs & intermediate outputs as your algorithm predicts, then good.
The circular part is how do you come up w/ the hypothesis of the circuit in the first place w/o causal interventions?
This doesn't seem like a problem though, cause we can just do hypothesis generation by causal interventions. If the resulting algorithm is "simple" (whatever that means), then our features are good. If not, then booooo.
yeah agree we should be writing up more. i sent to RB that paper planning doc and i've asked robert to start drafting a paper skeleton so that we have a really clear picture of where the holes are in the research
since you're going to be in town from tomorrow i think we should have a big chat then and do a list with assigned people and such
Ok, so my thought process was basically this, except measuring alignment to a high-level abstract causal model would give us an additional measure of correctness of high-level abstract description or whatever. Like, we can use ACDC to find a ciruit A at some arbitrary noise threashold, eyeball a high-level hypothesis for it via a description of some (pruned) subcircuit B of A as a causal machine M (the description would be human-interpretable by design, something ACDC doesn't provide by default), and then we can measure the accuracy of C in predicting the activations of B, giving us a measure of the human-interpretability-ness of B. Basically, the idea is that ACDC acts as a quick pruning strategy at some arbitrary noise threashold to help get started generating hypothesies for circuits in the sparse basis.
That was the original plan, but I am now very unsure about how we can compare systematically the scores of circuits found using the sparse basis and circuits in some other basis; there is probably a large variance of scores and an absurd number of circuits with fuzzy borders between them, and a lot of room for noise in where we draw the boundary between one circuit and another, or how we prune etc. Potentially we can come up with some complexity measure of the high-level description and measure the trade-off for both basies over a number of circuits, and if sparse coding is good we should see an improvement there, but that might be a little beyond the scope of this project.
Agreed on the nebulousness of circuits. I'll just give a go this week for several different types of circuits (e.g. the closing parenthesis one) and see what heuristics & results I come up w/. I'm overall fine if our paper has a mediocre implementation of ACDC & causal alignment on top of our dictionary learning. Like pretty solid overall, haha
I do think we can compare different ways of doing ACDC (for example, I'm ablating, which I could compare w/ restoring, which I could compare w/ shapely values of the top-5 ablated features).
This is ACDC w/ "[percentage effect]% | [cosine similarity]"
One thing I'd also like to check is the effect on intermediate layers on others. In the other graph, there was a connection between 4_1030 & 3_1273, but not this time. I should be able to easily record this & choose not to show it if the effect is < 1% or something. This is also another set of choices to make when implement this to compare to!
Also, I'm choosing to only look at the top-5 max-activating examples for a feature when I look at the differences when ablating causal features.
Ah, I fixed the issue. I've set it to always display at least 3 children for each node, but only recursively pursue (ie ablate children of) the most important ones across all children.
This is w/ top-k examples set to 10, whereas the others were 5. So some connections will be different.
well, I guess my point is that the actual circuits discovered by ACDC don't really matter that much, they are more like a guide for finding circuits if we go down the measuring-causal-alignment-ness route (better term needed: how about 'abstractiblity' or something similar), so the mediocre implementation would be fine. I'd like to get robustish measures of abstractibility though, that seems worthwhile. Like the idea of using different ways of doing ACDC, seems good for finding a large variety of circuits
yeah exactly; ACDC is kind arbitrary, so we can just manually find subcircuits to do interpretation and measure the abstractibility of
Alright, let's stick w/ abstractibility for now, lol
https://docs.google.com/document/d/1XOXQba0dQvOEuFdk6_RKrEoqHmbeGKDWi7nQeXoOuUg/edit?usp=sharing
I've got a few different settings here for graphs. These are K-5,10,20 (for how many datapoints we consider) & max vs halfway-to-max (for which k-datapoints we select. halfway is 0.5*max as a lower-bound).
where's the code you're running this with?
You can also look at my folder on the node, where there is also the auteoncoders for layers 1-5 in my directory
ok so finally found a method that consistently works a significant amount better than our standard linear encoders; it gets the same unexplained variance at about half the mean no features active. ran on 8 chunks compared to the 30 chunk run that hoagy did
method is basically linear dictionary as per usual but with 5-layer (could probably cut it down to 3 w/o significantly harming performance) learned ISTA-plus-momentum encoder
don't expect to use this significantly much for the circuit stuff I want to get into tomorrow, and also the sparsity-to-l1 thing is very unpredictable, but I guess it's nice to know we can do better than just linear encoding; if we end up needing sparser dictionaries we can just throw this + a lot of data at it. also note that I think sparsity-to-l1 will be a lot nicer if we pretrain with a linear encoder
yep
damn that's big! potentially more in the tank still with more data maybe?
uh yeah pretty noisy tho, i'd want to pretrain the decoder for like 1 chunk with linear encoders to get the dict right then freeze and train the encpder to that, then start training both in simul to get more consistent results, but yeah probably
took A Fucking While to find this but maybe useful especially for derivative stuff
hmm interesting wonder if that's necessary
if you point me to some dicts thats are beyond the pareto frontier i'll see how they do on autointerp
sparse_coding_aidan_new/output_4_rd/ or something? Don't know which ones are best, I'd look at the sparsity-80-odd ones achieving about 0.05 variance explained? Or the random one in the 40s
Well, I think it'd improve consistency, but I guess we could just big run + multiple initializations + take good
yeah fair. i think the benefit of this approach depends on whether we think after 1 epoch of linear training the dict is like 'basically right', and you just finetune the encoding strat, vs wanting it to learn something substantially different
my intuition is that the noise reduction isn't the computationally difficult part, compared to doing good feature finding, so i expect it not to help too much but i could easily be wrong
also what's the role of ISTA in this setup? i thought in ISTA et al the encoder was just a set of feature weights learned fresh for each case, rather than a particular (eg 5 layer) formula?
Right, so LISTA is basically ISTA but with some parameters learned, all unrolled into a net, so each layer corresponds with one iteration. It's more computationally efficient+converges better or something, there's a bunch of lit about it. Specifically this is LISTA+a momentum update, can't tell quite what it's called I think LFISTA is a reasonable name which I think I saw in the lit somewhere but can't find it now
ok, slightly good signal/sanity check, we're consistently beating [take the top-k components of PCA and project to that subspace)
the 'half mean no activations' was at 0.05, seems to be less big overall, but I think sparsity in ~100 range is reasonable anyways
Oh, nice sanity check!
@keen pivot where is the code you are using for the circuit stuffs?
I might also wait until you and hoagy have your meeting not sure what to do right now atm
I linked it above to Hoagy
@bitter turtle
Ah brill, sorry I missed that!
here's the correlation between interp score and feature variance, skew and kurtosis:
Cool, what are your takeaways on this? I'm not sure how much I trust autointerp. How many dicts are here?
takeaways are: searching by high variance (and mean, and % cases active, have also checked those now) for good features is not going to work, bit disappointing because i hoped that might give signal for which feats to choose in highly overcomplete dicts (though would be worth rerunning this with much larger sizes)
skew and kurtosis seem to be pretty much identical, no distinct signal between them but they're a reasonable proxy for feature goodness
might clean and send to openai, i think the above graph is wrong i logged the wrong variable lol but the effect stands, will update in a bit
Yep yep
I'm not sure autointerp will necessarily be correlated with 'useability for circuits' like maybe it will but also it's weird and fucky? I'd select by some combination of sparsity and proportion variance explained maybe. What did you find for n dead neurons and how was it correlated (at all, a little bit, literally just noise?)
Looks extremely cool, help me interpret what I'm looking at
Is this a dictionary circuit
Oh, I was trying to figure out how to make it high-resolution, but you just click "open in browser" after you click on it initially
The top row is dict for layer 5 residual. The rest are previous layers and are prepended by "4_..." for layer 4
The text is my interpretation of what the feature means
Insane, extremely cool
The percentage here means, given 10 activating examples of feature 4_1030, when I ablate feature 3_891, those activations go down by 36% on average.
The "... | 0.83" is cosine similarity to track how similar the directions are.
Agreed, I'm very excited. Still more work to do to more rigorously show it, but it's quite crazy that they fit so well so far!
running autointerp over the lista dicts and getting a v high proportion with no activations
assuming that it's layer2resid
Each is 2k. Can you plot like a hist of it?
I'm getting decent perplexities for layer2resid, so must be true
Layer2resid:
Perplexity for l1=1.16E-05: 51.76
Perplexity for l1=1.35E-05: 93.64
Perplexity for l1=1.56E-05: 147.65
Perplexity for l1=1.81E-05: 97.01```
Layer3resid:
Perplexity for l1=1.00E-05: 117.17
Perplexity for l1=1.16E-05: 67.78
Perplexity for l1=1.35E-05: 112.00
Though they're surprisingly close which may mean more about the similarity of the residual stream. Still should investigate this (in case something fishy is going on/code error)
how do these compare to previous perplexities?
Min is 43 perplexity. Base model is ~25
im getting an OOM when i i try to generate the histogram, dont wanna screw the big run 😟
min being like zero l1 alpha?
I just did layer 3 activations instead of layer 2
Like 3e--5
Perplexity for l1=1.16E-05: 51.76
Perplexity for l1=1.35E-05: 93.64
Perplexity for l1=1.56E-05: 147.65
Perplexity for l1=1.81E-05: 97.01
Perplexity for l1=2.10E-05: 46.16
Perplexity for l1=2.44E-05: 55.72
Perplexity for l1=2.83E-05: 43.57
Perplexity for l1=3.28E-05: 95.97
Perplexity for l1=3.81E-05: 98.02```
Super noisy down there. Maybe they're learning the identity
that's a lot of variation!
Huge
hard to learn the identity through 5 mlp layers!
Imma do qualitative on the 43 one to see if it learned the identity...oh, lololol
Also w/ 2k features
I can at least look at the decoder.
This is previous results on a good dict for layer 3. So the LISTA results are better, but may not be if they're learning an identity like thing.
43 would be huge if it's real. Would just need to train more & do better ML for convergence. I also expect needs to be bigger based off previous results replied to above.
I also like Aidan's idea (on a call from today) of comparing PCA w/ the dict across the same number of dimensions for both perplexity diff & variance explained.
I don't think it should be better because one benefit of dictionary is we can use more dimensions than the original, but if it is, then that's a cool result.
what was the lowest perplexity for a non-lista dict?
Weird. Also weird that it has better sparsity-to-variance explained than other ones but also a bunch of zero directions. I'm very confused
I think this histogram thing might be broken slightly. @pallid current what code did you use to generate the previous histogram, could you try comparing that code and this code?
i can but the code for that uses too much memory somehow
ok aaay
wel at least, gpt2 is almost full and i thought that was from my jupyter server but ive cut that and it's still nearly full so unsure tbh
**gpu2 loool
I decided to look at the nonzero activations & see if they're meaningful.