What is learned, when? | EleutherAI | Page 2

marsh sand Mar 28, 2024, 4:07 AM

#

I would say the model still converges after step 1000 ? (or 1430)

charred bluff Mar 28, 2024, 4:08 AM

#

the entire embedding space is rotating

#

that is why cos sim doesn't show it

gaunt plaza Mar 28, 2024, 4:08 AM

#

marsh sand Mar 28, 2024, 4:09 AM

#

charred bluff that is why cos sim doesn't show it

should I try calculating the euclidean distance from its original initialization?

charred bluff Mar 28, 2024, 4:09 AM

#

i guess the grad norm jump just after 100 can just be a normal oddity of training

charred bluff Mar 28, 2024, 4:09 AM

#

marsh sand should I try calculating the euclidean distance from its original initialization...

i am trying to think what makes sense to measure

#

if the entire embedding space, every vector in it, were say rotated by some specific matrix, what quantity would remain the same?

#

oh, i guess norm could also be changing, either norm or angle should be different

marsh sand Mar 28, 2024, 4:11 AM

#

the norm should be dependent on the rotation matrix (I guess?)

charred bluff Mar 28, 2024, 4:12 AM

#

i think the relationships between vectors would remain the same if they all experienced the same rotation

gaunt plaza Mar 28, 2024, 4:12 AM

#

If we dot product randomly sampled pairs

#

bonkies

charred bluff Mar 28, 2024, 4:13 AM

#

gaunt plaza If we dot product randomly sampled pairs

maybe normalize them

#

but yeah

#

random sample bc there's 50,000 of them and doing them all sounds like a pain

marsh sand Mar 28, 2024, 4:14 AM

#

I think I'm messed up but

#

since cosine similarity is also dot product divided by their noorm

#

shouldn't it be invariant to rotation?

charred bluff Mar 28, 2024, 4:15 AM

#

it should be invariant if you rotate both equally

marsh sand Mar 28, 2024, 4:15 AM

#

aha

charred bluff Mar 28, 2024, 4:15 AM

#

if they are rotated differently it should not

#

if eg you choose one reference token preferably one of the ones that trains

#

and you take cos with every other one, i think that is invariant to "all embeddings get rotated over time" and then running that across time might be cleaner than cos similarity

#

that ... might give us a cleaner indication of if they effectively stop training early

#

in fact, the sheer uniformity with which embeddings converge towards their final representation sort of suggests this? they do follow very clean trajectories

charred bluff Mar 28, 2024, 4:18 AM

#

charred bluff if eg you choose one reference token preferably one of the ones that trains

it is probably smarter to random sample like clyde suggested

#

like, the same sample for each timestep and enough pairs that you guarantee each embedding is included twice

#

if you pick one token and its weird it messes up the entire result

#

this will give you a scalar for each sample and you can track how much it varies

#

it might similarly make sense to track ratios of pairs of norms since we are throwing out that info when we take cos sim

marsh sand Mar 28, 2024, 4:21 AM

#

so we could get a pairwise similarity between a good bunch of tokens

#

and see how they change over time?

charred bluff Mar 28, 2024, 4:22 AM

#

marsh sand and see how they change over time?

yes, possibly also ratio of norm but i no longer recall if we have a ln right after embedding for pythia

#

ratio is less revealing if so

marsh sand Mar 28, 2024, 4:22 AM

#

hmm sounds a bit like a graph?

#

with each tokens as a node

#

an the similarity as edges?

charred bluff Mar 28, 2024, 4:23 AM

#

yes

#

i am dredging for if this suggests a clever way to calculate the all-to-all divide and conquer, cosine similarity is not transitive

charred bluff Mar 28, 2024, 4:54 AM

#

.... I do not see a clever way to avoid the all-to-all if we want to calculate them all

marsh sand Mar 28, 2024, 4:56 AM

#

agree I guess we will have to calculate the entire pairwise similarity (or whatever)

charred bluff Mar 28, 2024, 5:01 AM

#

yeah i think just subsample

#

it is some billions if you calculate all of them

#

oh, wait, only 400 million

#

still way high to do 150-or-so times

marsh sand Mar 28, 2024, 5:04 AM

#

how should we choose the subsample? will random sampling be god enough?

charred bluff Mar 28, 2024, 5:18 AM

#

i think random is fine

#

any oddities will come out in the wash give or take

#

if there's an easy way to exclude the tokens that don't train that would be better

marsh sand Mar 28, 2024, 5:56 AM

#

i get it

#

my arr review ends by in a few days so I can try comparning the pairwise cos-sim of a subsample after that

#

or someone could just go ahead

gaunt plaza Mar 28, 2024, 6:00 AM

#

I’ll give it a shot (I will fail miserably)

slender spoke Mar 28, 2024, 2:51 PM

#

I've done cosine similarities on entire embedding matrices before, you just have to do it intelligently. You don't want to do it individually by item, you want to do it to the whole matrix at the same time. I believe I used the following code, and it took about two minutes to run when I did this last.

charred bluff Mar 28, 2024, 3:23 PM

#

slender spoke I've done cosine similarities on entire embedding matrices before, you just have...

oh lol, yeah i was for some reason thinking of sklearn where this would take forever

#

torch probably does it in a reasonable span of time

charred bluff Mar 28, 2024, 5:09 PM

#

iff the model is actually rotating and not adding information to the embedding layer we should expect the all-to-all cosine similarity to stop changing on a good set of indices

gaunt plaza Mar 29, 2024, 4:12 PM

#

harold

#

i'll more or less be able to run it on my gpu once my unrelated training project is complete. I am however, unable to download 100GB worth of embeddings. i'll write what's basically a proof of concept wrapped in a loop that i'll verify through downloading './pythia-12b-weights/embed_only_0-29000.pkl' and base it off analyze-all-embeddings.ipynb, someone else will have to figure out if what i wrote is workable or garbage, or workable garbage (most likely).

gaunt plaza Mar 30, 2024, 12:05 PM

#

Guys. Good news. I am Gyges’ 69th follower on Twitter berk

gaunt plaza Mar 30, 2024, 6:24 PM

#

garbage here: https://github.com/Clyde013/pythia-embedding-analysis/blob/main/notebooks/embed_pairwise_cos_sims.ipynb
kernel kept dying on me, but the code works. someone else go ahead and rack up their internet and electricity bills on my behalf, because i'm a jobless NEET who doesn't pay rent.
I can make a PR if you want (once i wake up).
also threw in some pain in the ass visualisation code at the end. it should work out of the box no alterations necessary, except changing the ylim. and ~~maybe~~ definitely fixing bugs i did not catch.
ok it is 2:30am time to eep.

GitHub

pythia-embedding-analysis/notebooks/embed_pairwise_cos_sims.ipynb a...

Contribute to Clyde013/pythia-embedding-analysis development by creating an account on GitHub.

charred bluff Mar 30, 2024, 6:46 PM

#

gaunt plaza garbage here: https://github.com/Clyde013/pythia-embedding-analysis/blob/main/no...

thank you

gaunt plaza Mar 31, 2024, 4:46 AM

#

gaunt plaza garbage here: https://github.com/Clyde013/pythia-embedding-analysis/blob/main/no...

Once I’m free later in the day, I’ll work on improving the plot visualisation to generate multiple plots for multiple indices to make it more efficient. Unless someone already beat me to it.

#

Oh and I’ll also include the norm ratios I didn’t graph that out

#

Anything else I might have missed?

charred bluff Mar 31, 2024, 7:00 AM

#

gaunt plaza Anything else I might have missed?

lgtm, i would spot check the norm to make sure output is reasonable but that's probably a me thing b/c my pytorch isn't perfect, i would also not worry too much about visualizing and save to .json because we can always viz at a later step

gaunt plaza Mar 31, 2024, 7:05 AM

#

charred bluff lgtm, i would spot check the norm to make sure output is reasonable but that's p...

Yeah l pkl all the results so we can just read from saved files. I know you’re probably running some other tokenizer experiment to prove shmucks on the internet wrong and that clearly takes priority. So no rush. Maybe I’ll be able to clear enough space on my hard disk to download all the embedding checkpoints. @marsh sand is also busy, but I think you already have copies of the embedding files locally? If you could take 5 mins to just run the notebook cells they should work out of the box (just uncomment the correct file names).

charred bluff Mar 31, 2024, 7:06 AM

#

gaunt plaza Yeah l pkl all the results so we can just read from saved files. I know you’re p...

other experiment has been pending for 6mo, it isn't a big deal, if you have a clean script that works i can just run it on my machine because i have all the checkpoints

#

or a notebook, equivalently

gaunt plaza Mar 31, 2024, 7:06 AM

#

gaunt plaza garbage here: https://github.com/Clyde013/pythia-embedding-analysis/blob/main/no...

Should work ^

charred bluff Mar 31, 2024, 7:06 AM

#

kk, will try to run through it

#

i will have to figure out ... infra nonsense

#

like, tiny details with file paths etc

gaunt plaza Mar 31, 2024, 7:06 AM

#

Remember to uncomment the other checkpoint file names in the files list berk

gaunt plaza Mar 31, 2024, 7:07 AM

#

charred bluff i will have to figure out ... infra nonsense

That’s always the boring part

#

sadge

charred bluff Mar 31, 2024, 7:07 AM

#

the boring part consumes, perceptually, 90% of the time

#

the interesting part consumes zero time even though it objectively consumes time

#

the other 10% is woolgathering

gaunt plaza Mar 31, 2024, 7:07 AM

#

Unfortunately it is so. Thinking of ideas and debating them is intellectually stimulating and fun. Too bad we actually have to verify them blobsad

charred bluff Mar 31, 2024, 7:11 AM

#

charred bluff other experiment has been pending for 6mo, it isn't a big deal, if you have a cl...

(the one way it's time critical is that i have access to a good machine with a blank check right now and don't usually have that lmao)

gaunt plaza Mar 31, 2024, 7:12 AM

#

You can rent my singular 4090 thinkies

charred bluff Mar 31, 2024, 7:14 AM

#

this is somewhat larger than that

#

so i can run a medium-big pythia on it <_<

marsh sand Mar 31, 2024, 7:38 AM

#

gaunt plaza Yeah l pkl all the results so we can just read from saved files. I know you’re p...

Im suppose to be busy but..no reviewers are responding to my review so…

#

I can just try running jupyter file

#

Is segygs working on it or should i go ahead?

gaunt plaza Mar 31, 2024, 7:40 AM

#

sus no organisations gonna sponsor you (the one and only gyges) for your medium big Pythia project?

gaunt plaza Mar 31, 2024, 7:41 AM

#

marsh sand Im suppose to be busy but..no reviewers are responding to my review so…

sadge I’m sure it’s because none of the reviewers are able to find a single flaw in your perfect paper bro worried_partying

#

You got this man

marsh sand Mar 31, 2024, 8:00 AM

#

Haha thanks ill just go ahead an run it wont take long

gaunt plaza Mar 31, 2024, 8:02 AM

#

marsh sand Haha thanks ill just go ahead an run it wont take long

Don’t sweat it bro. Mamba got rejected. Not copium at all to say review process is broken harold

marsh sand Mar 31, 2024, 10:57 AM

#

Im redownloading the embeddings, and probably would be able to share the results by tomorroow

gaunt plaza Mar 31, 2024, 11:09 AM

#

marsh sand Im redownloading the embeddings, and probably would be able to share the results...

Rip your internet bro. At least I’ll be able to play league of legends on good ping since I’m not downloading it. O7 thanks for your sacrifice ultraberk

marsh sand Apr 1, 2024, 5:04 AM

#

oops

#

im trying to running the codes and

#

the pythia embeddings+pickle files dumped from the notebook takes about 360gb in my storage

#

and im out of disk space lol

gaunt plaza Apr 1, 2024, 5:25 AM

#

blobsad

#

Maybe we have to compute the embeddings and graph their plots on the fly then.

#

If we save the precomputed cos sin embeddings it’s going to take up a lot of hard disk space

twin heron Apr 1, 2024, 5:59 AM

#

You are trying to create a visualization like that but for all embeddings?

#

https://www.lesswrong.com/posts/2JJtxitp6nqu6ffak/basic-facts-about-language-models-during-training-1#Tokenizer_Embeddings_are_rapidly_learnt_then_stabilize

Basic facts about language models during training — LessWrong

We thank Eric Winsor, Lee Sharkey, Dan Braun, Carlos Ramon Guevara, and Misha Wagner for helpful suggestions and comments on this post. …

gaunt plaza Apr 1, 2024, 6:30 AM

#

twin heron https://www.lesswrong.com/posts/2JJtxitp6nqu6ffak/basic-facts-about-language-mod...

Lol wait a minute… @charred bluff

marsh sand Apr 1, 2024, 6:31 AM

#

twin heron You are trying to create a visualization like that but for all embeddings?

woah this looks coool

marsh sand Apr 1, 2024, 6:36 AM

#

gaunt plaza <:blobsad:824410224978165801>

I have access to a h100 server with bigger storage ill try again later at night

gaunt plaza Apr 1, 2024, 6:36 AM

#

marsh sand woah this looks coool

I swear to god we even linked that article waaaaay back and just forgot about it

marsh sand Apr 1, 2024, 6:37 AM

#

lol

gaunt plaza Apr 1, 2024, 8:36 AM

#

twin heron You are trying to create a visualization like that but for all embeddings?

i realise i forgot to respond to you. sorry, that's really inconsiderate of me.

Yes we are doing something similar but on an individual embedding scale across training steps. Initially when we plotted the cos-sim between the current step of a token representation and it's final representation (fig1) you can see that the embeddings seem to converge towards the final representation throughout the entire training sequence. However, we used a linear probe (several different classifier models on a toy task) to determine whether relevant information to a token's attributes (in our case the task was part of speech classification) is learned in the embedding at which step (fig2).

Now note the discrepancy between our initial conclusion and what fig2 implies. At step 1000-2000 the embeddings seem to have gained a majority of the token information, and plateaus at 10,000 steps. In fig1, the embeddings only begin converging towards their final representation at 10,000 steps, but before that the embeddings remain "constant" (at least relative to their final representations).

As Gyges has pointed out, this might imply that the embeddings actually begin to just rotate in embedding space once it has completed its training, and we are currently trying to verify this hypothesis of rotation by plotting pairwise cos-sims between tokens, using a singular token as an anchor point (which we will repeat multiple times in case said singular token for some reason has a skill issue). If this is true, what we might see in the pairwise cos-sim line graphs, is many lines that zigzag wildly and overlap (tokens are updating their representations) for the first 10,000 steps, before the lines all flatten out and become straight (the entire space, and all tokens, rotate (fig3)).

#

at least, i think that's what we're doing drinkies

marsh sand Apr 1, 2024, 10:59 AM

#

marsh sand I have access to a h100 server with bigger storage ill try again later at night

ive started re-running, this time it seems fine

gaunt plaza Apr 1, 2024, 11:29 AM

#

marsh sand ive started re-running, this time it seems fine

🙏

#

any reviewer responses?

marsh sand Apr 1, 2024, 11:30 AM

#

gaunt plaza 🙏

got one and is keeping the original score lol

gaunt plaza Apr 1, 2024, 11:31 AM

#

marsh sand got one and is keeping the original score lol

hope the original score is a good one...? harold

marsh sand Apr 1, 2024, 11:32 AM

#

slighlty?

#

haha

#

I would have the fix and resubmit the paper anyway no worries

marsh sand Apr 1, 2024, 12:31 PM

#

just finshed dumping the sim matrixs

gaunt plaza Apr 1, 2024, 12:45 PM

#

marsh sand just finshed dumping the sim matrixs

awesome. do you want me to update the code to run for multiple embedding indices at once? i should right... more efficient.

marsh sand Apr 1, 2024, 12:47 PM

#

what do you mean by multiple embeddings at once?

gaunt plaza Apr 1, 2024, 12:48 PM

#

the current plotting code only creates 1 graph when it loads and unloads each calculated cos-sim matrix into memory

#

at the same time i am also really worried about overloading CPU memory by storing too many linecollection instances...

charred bluff Apr 1, 2024, 12:56 PM

#

gaunt plaza Lol wait a minute… <@441658587404697600>

yeah i have seen that one

charred bluff Apr 1, 2024, 12:56 PM

#

marsh sand I have access to a h100 server with bigger storage ill try again later at night

i have storage

#

i just uh have a poor well of spare focus rn

gaunt plaza Apr 1, 2024, 12:56 PM

#

hows it going

marsh sand Apr 1, 2024, 12:57 PM

#

im trying to push the extracted sim matrixs but its taking a while

marsh sand Apr 1, 2024, 12:57 PM

#

gaunt plaza the current plotting code only creates 1 graph when it loads and unloads each ca...

this will help definitely

charred bluff Apr 1, 2024, 12:58 PM

#

it is going ... okay

gaunt plaza Apr 1, 2024, 12:58 PM

#

marsh sand im trying to push the extracted sim matrixs but its taking a while

yeah i wonder why...? ultraberk

gaunt plaza Apr 1, 2024, 12:58 PM

#

charred bluff it is going ... okay

good to hear man

charred bluff Apr 1, 2024, 12:58 PM

#

i assume the extracted sim matrixes are uncomfortably large

marsh sand Apr 1, 2024, 12:59 PM

#

yup its 9.6gb per file

#

and I have 153 of those

charred bluff Apr 1, 2024, 1:00 PM

#

1.5 tb? yeah that's a decent size

charred bluff Apr 1, 2024, 1:01 PM

#

twin heron You are trying to create a visualization like that but for all embeddings?

i think we're doing a random ish sampling but basically yes, and basically we want to see what the rate of change for that looks like over this entire thing

#

clyde's breakdown more detailed and good and also he actually wrote the code for this step

gaunt plaza Apr 1, 2024, 3:17 PM

#

you know, if the tokens really are just rotating... why?

#

would freezing the tokens during their rotary stage impact the training in any way?

#

and after that... then what? any direction for the project actually, i'm starting to see the light at the end of tunnel here.

marsh sand Apr 1, 2024, 3:53 PM

#

#

im not sure how this will help but

#

each line in the plot means the average similarity of token "i" and every other token in a given sstep

#

Im assumming the straight lines there are those the tokens that arent trained for the entire training loop

gaunt plaza Apr 1, 2024, 3:57 PM

#

marsh sand each line in the plot means the average similarity of token "i" and every other ...

x axis being step?

#

no wait

marsh sand Apr 1, 2024, 3:59 PM

#

yup x-axis by checkpoint

gaunt plaza Apr 1, 2024, 3:59 PM

#

ah i see

marsh sand Apr 1, 2024, 3:59 PM

#

oh no

gaunt plaza Apr 1, 2024, 3:59 PM

#

interesting graph

marsh sand Apr 1, 2024, 3:59 PM

#

wait a sec

gaunt plaza Apr 1, 2024, 4:01 PM

#

that graph isn't possible...

#

cosine similarity is between 0 to 1

#

bonkies

marsh sand Apr 1, 2024, 4:04 PM

#

i think you get values between -1 and 1

#

isnt it..?

#

assuming I have the right numbers... im just trying random things but this seems a bit interesting

#

the two images are the same plots with different scaling (the second in log scale) but it seems that actually the av cos-sim actively changes after the 10,000th step? this probably might be some human-set parameters

#

im still trying to push the calculated cos-sims to hf but its not working ill try again tomoroow

gaunt plaza Apr 1, 2024, 4:12 PM

#

marsh sand isnt it..?

huh yeah it is my bad G

gaunt plaza Apr 1, 2024, 4:13 PM

#

marsh sand im still trying to push the calculated cos-sims to hf but its not working ill tr...

ngl its perfectly fine to be unable to do that. even i don't want to download it. maybe once i fix up the multi-indexing plotting you can run them locally for like multiple indices.

#

i haven't implemented norm graphing yet. but its getting late where i am as well and i have stuff to do in the morning.

#

hey what the heck @marsh sand isn't it 1am where you are

marsh sand Apr 1, 2024, 4:17 PM

#

haha yes surprising you know where i live

#

which tz do you live?

gaunt plaza Apr 1, 2024, 4:17 PM

#

marsh sand haha yes surprising you know where i live

yonsei uni its in your twitter and ur in korea i assume

gaunt plaza Apr 1, 2024, 4:18 PM

#

marsh sand which tz do you live?

1 hour behind u, sg time

gaunt plaza Apr 1, 2024, 4:19 PM

#

marsh sand assuming I have the right numbers... im just trying random things but this seems...

i really don't know what to make of these graphs lol. @charred bluff any ideas?

#

@marsh sand just pushed to my branch an updated version with n_samples parameter to plot multiple graphs at once.
https://github.com/Clyde013/pythia-embedding-analysis/blob/main/notebooks/embed_pairwise_cos_sims.ipynb
you might not want to set n_samples too high, its not memory optimized in any way, you might have to wrap the whole cell in a loop if you intend on running it overnight.

GitHub

pythia-embedding-analysis/notebooks/embed_pairwise_cos_sims.ipynb a...

Contribute to Clyde013/pythia-embedding-analysis development by creating an account on GitHub.

charred bluff Apr 1, 2024, 4:22 PM

#

marsh sand the two images are the same plots with different scaling (the second in log scal...

that is so incredibly weird

#

nothing special happens at 10k

marsh sand Apr 1, 2024, 4:22 PM

#

gaunt plaza <@750960095004983446> just pushed to my branch an updated version with n_samples...

thanks ill try it out tomorrow

charred bluff Apr 1, 2024, 4:23 PM

#

i would 100% want to just do the calculation and do plotting separately but definitely cannot today

marsh sand Apr 1, 2024, 4:24 PM

#

by seperately u mean one plot per token?

gaunt plaza Apr 1, 2024, 4:25 PM

#

marsh sand by seperately u mean one plot per token?

eyyy guess what my code cell does…

#

That’s right, plots every token independently!

#

ultraberk

#

Goodnight gson

#

Goodnight gyges

twin heron Apr 1, 2024, 4:49 PM

#

marsh sand yup its 9.6gb per file

Use float16 instead of 32

#

And raw binary instead of pickle, help a little bit

twin heron Apr 1, 2024, 4:53 PM

#

gaunt plaza i realise i forgot to respond to you. sorry, that's really inconsiderate of me. ...

fig1 is weird. Did you double-check if it is correct? Like doing the same calculation but comparing with the first embeddings instead of the last one?

charred bluff Apr 1, 2024, 5:25 PM

#

marsh sand assuming I have the right numbers... im just trying random things but this seems...

figured this out. loss gets rescaled at increments of 1000 so we will sometimes expect to see discontinuities there

charred bluff Apr 1, 2024, 5:29 PM

#

twin heron fig1 is weird. Did you double-check if it is correct? Like doing the same calcul...

https://github.com/segyges/pythia-embedding-analysis/blob/main/notebooks/cal_cos_sim.ipynb it's pretty straightforward, i think

GitHub

pythia-embedding-analysis/notebooks/cal_cos_sim.ipynb at main · seg...

Contribute to segyges/pythia-embedding-analysis development by creating an account on GitHub.

#

i guess maybe i am being fooled by the log scale

#

but like, it sure seems like no?

#

like: that does look like an actual discontinuity, we just can't see more than one of them at precision-scaling cliffs because everything past 10^3 gets compressed along the x axis

twin heron Apr 1, 2024, 5:39 PM

#

charred bluff https://github.com/segyges/pythia-embedding-analysis/blob/main/notebooks/cal_cos...

I don't get this code, what should it do exactly?

#

What is data.keys?

twin heron Apr 1, 2024, 5:50 PM

#

twin heron What is data.keys?

Got it, its the steps

twin heron Apr 1, 2024, 5:54 PM

#

charred bluff https://github.com/segyges/pythia-embedding-analysis/blob/main/notebooks/cal_cos...

But thats not the code for fig1, right?

charred bluff Apr 1, 2024, 6:08 PM

#

twin heron But thats not the code for fig1, right?

for this one? it is. like, doesn't have the visualization but that is how that was extracted

twin heron Apr 1, 2024, 6:10 PM

#

But I thought the plot was comparing the current embedding of a token against its final embedding

#

And thats not what the code is doing, correct?

charred bluff Apr 1, 2024, 7:03 PM

#

there are two blocks doing compares, one is current against final and one is current against next

twin heron Apr 1, 2024, 7:10 PM

#

I see

charred bluff Apr 1, 2024, 8:48 PM

#

charred bluff for this one? it is. like, doesn't have the visualization but that is how that w...

wait, no, i am wrong

#

the reason the line diverges at 1000 is the sample rate

#

points are further apart

#

this means inter point variation is much higher

twin heron Apr 1, 2024, 9:44 PM

#

charred bluff the reason the line diverges at 1000 is the sample rate

Can you elaborate?

#

And what are the lines on the graph? All tokens on the vocab, random, or specific ones?

#

Because this plot is very weird by itself, I would say that something is wrong right away. But the point that the lines go up is different for each token, which indicates that it was because of a gradient update, not a problem with the code

charred bluff Apr 1, 2024, 9:47 PM

#

twin heron Can you elaborate?

pythia checkpoints at powers of two up to 512 and every 1000 steps thereafter

#

so resolution is much finer below step 1000

twin heron Apr 1, 2024, 9:47 PM

#

charred bluff pythia checkpoints at powers of two up to 512 and every 1000 steps thereafter

oh

#

I didn't know that

#

But still, that wouldn't explain the plot

charred bluff Apr 1, 2024, 9:48 PM

#

it explains the start of the plot

twin heron Apr 1, 2024, 9:48 PM

#

The straight lines?

charred bluff Apr 1, 2024, 9:49 PM

#

yes

twin heron Apr 1, 2024, 9:49 PM

#

yeah, it does

charred bluff Apr 1, 2024, 9:49 PM

#

What else strikes you as weird about the plot?

twin heron Apr 1, 2024, 9:49 PM

#

Nothing

charred bluff Apr 1, 2024, 9:49 PM

#

yeah i think that's what that is

twin heron Apr 1, 2024, 9:49 PM

#

charred bluff pythia checkpoints at powers of two up to 512 and every 1000 steps thereafter

This explains the straight lines

charred bluff Apr 1, 2024, 9:50 PM

#

fwiw does suggest the plot needs rescaling along x-axis, that we have steps along the bottom mislabelled a touch because that looks further out than 1,000 on that graph

twin heron Apr 1, 2024, 9:50 PM

#

And the fact that the point is different for each line is probably because some tokens didn't appear on those checkpoints

charred bluff Apr 1, 2024, 9:51 PM

#

so the x-axis label assumes constant distance per step, which is not accurate

twin heron Apr 1, 2024, 9:52 PM

#

gaunt plaza i realise i forgot to respond to you. sorry, that's really inconsiderate of me. ...

So what are you guys doing rn based on that?

charred bluff Apr 1, 2024, 9:52 PM

#

the current idea is to check if tokens' cos-sim with each other becomes much more constant later which would indicate that all embeddings are changing more or less uniformly

#

or "the embedding space is rotating, but tokens are not changing"

#

this is i think the last idea i have for why the fit with part of speech tags would be good when the embeddings are definitely still absorbing grads

#

the alternative is just "information is going into the embeddings, but it's information that is encoded in some way we can't read off easily"

#

there's some code upwards for that but i have not been doing much with it, you'd want to wait for clyde to be online for a better breakdown

twin heron Apr 1, 2024, 9:56 PM

#

charred bluff the current idea is to check if tokens' cos-sim with each other becomes much mor...

So for example if the token cat cos sim with dog is changing the same as cat cos sim with house?

charred bluff Apr 1, 2024, 9:56 PM

#

twin heron So for example if the token cat cos sim with dog is changing the same as cat cos...

more if it's changing at all

twin heron Apr 1, 2024, 9:57 PM

#

If embeddings are changing?

charred bluff Apr 1, 2024, 9:57 PM

#

all the tokens can be changing but their relationships can be the same

twin heron Apr 1, 2024, 9:57 PM

#

Ok

charred bluff Apr 1, 2024, 9:58 PM

#

it should hopefully eventually end up as basically the same type of thing to analyze/visualize but it would have a lot more points so we are likely to need to sample

twin heron Apr 1, 2024, 9:58 PM

#

Well, they barely change actually

#

At the beginning they change a lot because of how attention works at the beginning

#

When the model starts training, attention doesn't know what to attend, so it attends to everything (or some heads to closer tokens, due to RoPE). So attention makes the model backpropagates the signal to a lot of embeddings at the beginning

charred bluff Apr 1, 2024, 9:59 PM

#

oh, yeah, they almost totally converge like ... midway through the run? changes after that are very small. but, the "how well can we fit a linear probe for being a noun" converges way earlier

twin heron Apr 1, 2024, 9:59 PM

#

charred bluff oh, yeah, they almost totally converge like ... midway through the run? changes ...

Way earlier

charred bluff Apr 1, 2024, 10:00 PM

#

comically earlier. so if the relationships between embeddings stop changing that suggests that subsequent changes are basically noise, it's the embedding layer adjusting to keep up with the network itself

#

but not storing any additional info

#

but also: it can just be storing more info in some way that's not amenable to linear probe and that has negligible impact on cos sim

twin heron Apr 1, 2024, 10:01 PM

#

Well, they do change, just not as much. In simple terms at the beginning they realise what they are, after that, they enrich itself

charred bluff Apr 1, 2024, 10:01 PM

#

so a positive result (cos sims continue to change) would kind of be more informative

#

because there's only one hypothesis fitting that, which is that embeddings are still training

charred bluff Apr 1, 2024, 10:02 PM

#

twin heron Well, they do change, just not as much. In simple terms at the beginning they re...

i mean, we could also test this by freezing embeddings about 2x the number of steps it takes them to seem to converge to see if it harms the network at all

twin heron Apr 1, 2024, 10:02 PM

#

And nothing is that simple with embeddings. For example the token "dog" will contain information about the animal dog. But with attention it may change abruptly when it realizes the word is actually "underdog"

#

Now how much information of underdog is in dog, I would love to know

charred bluff Apr 1, 2024, 10:03 PM

#

sure, but how much of that information is in the embedding layer vs the attn layer?

twin heron Apr 1, 2024, 10:03 PM

#

Or if it is mostly stored on dense attention weights

charred bluff Apr 1, 2024, 10:03 PM

#

yeah

twin heron Apr 1, 2024, 10:03 PM

#

yeah

charred bluff Apr 1, 2024, 10:04 PM

#

... actually, and I am trying to convince myself that I do not just think this because I have an ongoing grudge against embeddings, it seems like it shouldn't be stored in the embedding itself

twin heron Apr 1, 2024, 10:04 PM

#

I have some ideas on how to test that with checkpoints. You can use the embeddings of an earlier checkpoint on a later model. And see what tokens/words change the context vector the most

charred bluff Apr 1, 2024, 10:05 PM

#

because: for gradient about that to reach the embedding it has to first go through all the dense layers, which have plenty of "space"

#

the early part when the embeddings are still training significantly the FC layers are all still basically random

charred bluff Apr 1, 2024, 10:05 PM

#

twin heron I have some ideas on how to test that with checkpoints. You can use the embeddin...

oh that's clever

#

but: i think it neglects the "entire embedding space is shifted" possibility

twin heron Apr 1, 2024, 10:06 PM

#

charred bluff the early part when the embeddings are still training significantly the FC layer...

Yeah, I generally avoid doing research on early checkpoints. I find the problems there too hard to solve

twin heron Apr 1, 2024, 10:06 PM

#

charred bluff but: i think it neglects the "entire embedding space is shifted" possibility

Wdym by shifted?

#

rotated?

charred bluff Apr 1, 2024, 10:06 PM

#

rotated or rescaled, yeah

#

if you uniformly dilate or rotate the embedding vectors it will break the network totally but they have the same amount of information

twin heron Apr 1, 2024, 10:06 PM

#

I don't think that would be the case

#

Simply because mathematically there is no reason for that to happen

charred bluff Apr 1, 2024, 10:07 PM

#

it would be interesting to test by doing your test and then seeing if you can normalize the earlier embedding matrix to align with the later one by a single pure rotation

charred bluff Apr 1, 2024, 10:07 PM

#

twin heron Simply because mathematically there is no reason for that to happen

eh, so figure the residual is at some angle wrt the origin

#

as the network trains this angle should basically brownian drift

#

because it's not under any particular pressure, it's not the goal of training

twin heron Apr 1, 2024, 10:08 PM

#

What is the origin?

#

embeddings?

charred bluff Apr 1, 2024, 10:08 PM

#

just all-ones vector

#

... i should not have called this the origin, whoops

#

"some basis vector"

twin heron Apr 1, 2024, 10:08 PM

#

charred bluff as the network trains this angle should basically brownian drift

Ok, but then they wouldnt rotate

charred bluff Apr 1, 2024, 10:08 PM

#

they would a little bit

#

they wouldn't a lot but it can have basically noise rotation

twin heron Apr 1, 2024, 10:09 PM

#

idk

#

I wouldnt bet on that

charred bluff Apr 1, 2024, 10:09 PM

#

i think that's actually testable

twin heron Apr 1, 2024, 10:09 PM

#

It is possible, but I would say that it is unlikely

#

And personally, I would do research without thinking about that

#

Too many problems already

charred bluff Apr 1, 2024, 10:09 PM

#

yeah fair

twin heron Apr 1, 2024, 10:09 PM

#

Don't give me one more

#

please

charred bluff Apr 1, 2024, 10:10 PM

#

the number of things you can check mathematically is absurdly high

#

actually i can think of a fairly amusing way of testing this

twin heron Apr 1, 2024, 10:11 PM

#

charred bluff the number of things you can check mathematically is absurdly high

yeah, but models act in weird ways. because data is weird

charred bluff Apr 1, 2024, 10:11 PM

#

twin heron yeah, but models act in weird ways. because data is weird

yeah you cannot check all the things and only discretion keeps you from just running statistics forever

charred bluff Apr 1, 2024, 10:12 PM

#

charred bluff actually i can think of a fairly amusing way of testing this

just give it a one-token input for some good stretch of common tokens. see if for any given common token the later model's residual is the same as the earlier one's with some rotation applied

#

but, yeah

#

i think after this specific check is done i am willing to try to put a bow on this one

twin heron Apr 1, 2024, 10:13 PM

#

twin heron yeah, but models act in weird ways. because data is weird

For example, look at that

#

Dimensions values for q and q_rope, Pythia 1B

#

The model already has information of the position before RoPE is applied

#

charred bluff Apr 1, 2024, 10:15 PM

#

... that is insane

twin heron Apr 1, 2024, 10:15 PM

#

Thats why I am very humble when thinking about models

charred bluff Apr 1, 2024, 10:15 PM

#

that basically means the model is learning to reverse the rope transformation before going into it, no?

#

it 1) knows what the rope transformation will be and 2) can apply the reverse operation

twin heron Apr 1, 2024, 10:16 PM

#

Well, it encodes some positional information, but not entirely

twin heron Apr 1, 2024, 10:17 PM

#

twin heron

For example here, this sample is actually three samples, separated by <|endoftext|>, and you can see that the model also separates the positional information per sample

#

Some dimensions have very clear patterns

#

anyway, Im going off-topic here

marsh sand Apr 2, 2024, 1:15 AM

#

the big bro talks always have a lot to learn

marsh sand Apr 2, 2024, 1:16 AM

#

twin heron I have some ideas on how to test that with checkpoints. You can use the embeddin...

think this is an easy idea to test?

#

we can just switch the embedding layers and run some evaluations to see how it changes

gaunt plaza Apr 2, 2024, 1:37 AM

#

Augh I hate waking up

marsh sand Apr 2, 2024, 1:37 AM

#

haha im already at class

twin heron Apr 2, 2024, 4:39 AM

#

marsh sand think this is an easy idea to test?

I don't want to divert you guys from what you're doing, you know

gaunt plaza Apr 2, 2024, 5:11 AM

#

twin heron I don't want to divert you guys from what you're doing, you know

It’s not like we’re doing anything particularly interesting. And that’s also relevant somewhat to what we’re doing. Actually what we’re doing kind of has no direction… thinkies

charred bluff Apr 2, 2024, 12:40 PM

#

gaunt plaza It’s not like we’re doing anything particularly interesting. And that’s also rel...

i actually agree i think we're roughly done

#

these are just interesting threads to chase

#

the original question was more or less whether embeddings contain part of speech information early in training, they do, while we have a big ole block of embedding layers there are other things it seems to make sense to check but none of those seem essential

#

ideally should wrap all existing files in a notebook and throw them somewhere sometime soon

gaunt plaza Apr 4, 2024, 7:51 AM

#

Well I’m back from the dead berk
@marsh sand how’s it going

marsh sand Apr 4, 2024, 7:58 AM

#

marsh sand assuming I have the right numbers... im just trying random things but this seems...

im trying to understand what happens after the 1000th step. just averaging everything doesnt seem to give much insight but im not sure what other things I can try

gaunt plaza Apr 4, 2024, 8:08 AM

#

gaunt plaza <@750960095004983446> just pushed to my branch an updated version with n_samples...

Can you run the last cell in this notebook

#

Unless you already did and I missed the plots

#

How’s the review going btw

marsh sand Apr 4, 2024, 8:12 AM

#

gaunt plaza How’s the review going btw

the other reviewers didn't give me an answer so i guess ill just fix and re esubmit

gaunt plaza Apr 4, 2024, 8:22 AM

#

marsh sand the other reviewers didn't give me an answer so i guess ill just fix and re esub...

sadsnek

marsh sand Apr 4, 2024, 12:58 PM

#

gaunt plaza Unless you already did and I missed the plots

sorry i missed this

#

these are the plots but they are look like a huge mess

gaunt plaza Apr 4, 2024, 1:27 PM

#

Well that’s to be expected

#

Maybe subsampling for plots is a good idea

#

Gyges does seem to be correct, we do see lots of fluctuation at the start then flattening out at the end. So I guess the embedding space really is slightly rotating?

marsh sand Apr 5, 2024, 4:08 AM

#

seems like it

#

as the attention learns better to concentrate on related tokens

#

it looke like tokens with lower cos-sim show lower fluctuation later in the training

charred bluff Apr 5, 2024, 5:29 AM

#

i think the key metric would be comparing how much this changes vs the absolute cos sim with final

gaunt plaza Apr 5, 2024, 6:25 AM

#

charred bluff i think the key metric would be comparing how much this changes vs the absolute ...

How would we do that

#

actually i dont get it what does "this" refer to

charred bluff Apr 5, 2024, 6:43 AM

#

yeah come to think this isn't really easy to reduce to a rate of change per embedding

gaunt plaza Apr 5, 2024, 7:51 AM

#

I’m gonna like “phone a friend”

charred bluff Apr 5, 2024, 8:47 AM

#

i am gonna ... maybe get around to looking at this again at some point

gaunt plaza Apr 5, 2024, 8:48 AM

#

charred bluff i am gonna ... maybe get around to looking at this again at some point

ultraberk

#

Too busy getting .geese’d in #off-topic ?

#

I’ll ask around and maybe someone in another discord server provides a wonderful idea because I’m sure as hell not smart enough to come up with one

charred bluff Apr 5, 2024, 8:51 AM

#

gaunt plaza I’ll ask around and maybe someone in another discord server provides a wonderful...

i just have insomnia at the moment so i am not doing anything very smart

gaunt plaza Apr 5, 2024, 8:57 AM

#

charred bluff i just have insomnia at the moment so i am not doing anything very smart

Hope it gets better man

cold garnet Apr 5, 2024, 8:59 AM

#

Is this still the TODO list? Is there an updated one?

gaunt plaza Apr 5, 2024, 9:01 AM

#

cold garnet Is this still the TODO list? Is there an updated one?

I think the TODO on the GitHub README is the most updated one, but even that would be quite far behind

cold garnet Apr 5, 2024, 9:04 AM

#

gaunt plaza I think the TODO on the GitHub README is the most updated one, but even that wou...

Going through the repo. Anything on top off your head that you all need more eyes on?

gaunt plaza Apr 5, 2024, 9:04 AM

#

gaunt plaza i realise i forgot to respond to you. sorry, that's really inconsiderate of me. ...

Read this ^

gaunt plaza Apr 5, 2024, 9:05 AM

#

marsh sand these are the plots but they are look like a huge mess

Look at these ^

gaunt plaza Apr 5, 2024, 9:05 AM

#

charred bluff yeah come to think this isn't really easy to reduce to a rate of change per embe...

Think like gyges ^
Because I can’t…

cold garnet Apr 5, 2024, 9:05 AM

#

Thanks!

gaunt plaza Apr 5, 2024, 9:05 AM

#

Ideas welcome

#

berk

marsh sand Apr 5, 2024, 9:20 AM

#

I guess we can try to see if we can rotate the embedding from the 1000th layer to get the final one?

#

Im not sure how though

#

Try decomposing?

#

Or dimensional reduction?

gaunt plaza Apr 5, 2024, 9:32 AM

#

marsh sand I guess we can try to see if we can rotate the embedding from the 1000th layer t...

That’s a good idea

gaunt plaza Apr 5, 2024, 9:36 AM

#

marsh sand Im not sure how though

https://math.stackexchange.com/questions/598750/finding-the-rotation-matrix-in-n-dimensions

Mathematics Stack Exchange

Finding the rotation matrix in n-dimensions

Suppose that we know two real vectors with n components, which are linked by some arbitrary transformation/scaling/rotation/shearing...

Now, I think that it is possible to know which is the scaling

#

trollyikes

#

what the fuck

gaunt plaza Apr 5, 2024, 1:02 PM

#

Ok is anyone able to decipher this into English drinkies

heavy flame Apr 5, 2024, 8:45 PM

#

what server is that

#

you can just use torch’s orthogonal parametrizations

gaunt plaza Apr 6, 2024, 2:36 AM

#

heavy flame what server is that

I just hang out in yannic kilchers yt discord

gaunt plaza Apr 6, 2024, 2:37 AM

#

heavy flame you can just use torch’s orthogonal parametrizations

Yea the implementation doesn’t seem like the hard part, understanding what I write is the hard part, and unfortunately for me the latter part seems to occur too often

heavy flame Apr 6, 2024, 3:17 AM

#

gaunt plaza I just hang out in yannic kilchers yt discord

it’s weird that you’d get a response like that then

gaunt plaza Apr 6, 2024, 3:20 AM

#

heavy flame it’s weird that you’d get a response like that then

Nah zickzack is the GOAT man

#

Just depends on who you interact with in every server, there’s always good people. Just gotta know where to look and who to look for.

marsh sand Apr 6, 2024, 12:57 PM

#

heavy flame you can just use torch’s orthogonal parametrizations

@gaunt plaza then i guess our future today is trying that orth. parameterizations or switching the embedding layer of different checkpoints to see if it works?

gaunt plaza Apr 6, 2024, 12:59 PM

#

marsh sand <@392184220031909898> then i guess our future today is trying that orth. paramet...

I haven’t the foggiest. I’m not home yet and won’t be until quite late.

#

I would like to try the SO(N) matrices, zickzack is pretty much always right about this math stuff

#

How’s the review. Resubmitted the paper?

marsh sand Apr 6, 2024, 1:48 PM

#

Not yet im busy with other works so probably work on the paper some time next week

#

Then ill try the orth. parametrization thing im new to it but im basically new to most things so it wont really matter

gaunt plaza Apr 12, 2024, 9:04 AM

#

@marsh sand I’ll get back to work tonight, any progress?
None is fine as well not like I’ve done jack shit either.

marsh sand Apr 12, 2024, 12:09 PM

#

gaunt plaza <@750960095004983446> I’ll get back to work tonight, any progress? None is fine ...

Totally forgot about this sorry

gaunt plaza Apr 12, 2024, 1:50 PM

#

marsh sand Totally forgot about this sorry

ultraberk how’s the paper man

charred bluff Apr 12, 2024, 2:40 PM

#

i might work on this the weekend/next week <_<

marsh sand Apr 13, 2024, 1:30 AM

#

gaunt plaza <:ultraberk:831970456642650172> how’s the paper man

I just got the meta reviews its a 3/5. Probably will have to fix and resubmit somewhere

marsh sand Apr 14, 2024, 1:11 AM

#

@gaunt plaza you interested in multilingual topics by any chance?

gaunt plaza Apr 14, 2024, 1:15 AM

#

marsh sand <@392184220031909898> you interested in multilingual topics by any chance?

Erm, yeah I did some multilingual stuff before for a school project thinkies

#

I think… sus

slender spoke Apr 16, 2024, 6:16 PM

#

Sorry I've been away for a bit. I'm not entirely convinced by the "it rotates" argument, at least for after around ~90,000 steps, because it is very much stable from that point:

charred bluff Apr 16, 2024, 7:33 PM

#

slender spoke Sorry I've been away for a bit. I'm not entirely convinced by the "it rotates" a...

when i have four spare brain cells i'm gonna try to wrap all these up in one block of markdown, i think for the rotaty bits there might be some postprocessing necessary because those graphs are nigh unreadable but

slender spoke Apr 18, 2024, 12:30 AM

#

Interestingly, the results for token 0 (<|end_of_text|>) and token 1(<|padding|>) are very different from "normal" tokens. Note especially the different scales for the end of text token. I don't think they use the padding token much in training pythia:

charred bluff Apr 18, 2024, 2:32 AM

#

i don't think they use the padding token at all

#

i guess they must if it's receiving grad but that looks like it's maybe just decay

twin heron Apr 18, 2024, 3:54 AM

#

charred bluff i don't think they use the padding token at all

It was probably in the data already

twin heron Apr 18, 2024, 4:07 AM

#

slender spoke Interestingly, the results for token 0 (<|end_of_text|>) and token 1(<|padding|>...

<|endoftext|> doesn't carry any information and can be used in any context, tokens like that need sparser activations. I bet you will see similar things on tokens like \n, comma, dot, parentheses and connectors like "and", "or", "from", "is", "a"

charred bluff Apr 18, 2024, 4:10 AM

#

twin heron It was probably in the data already

that would make sense

#

that would not explain why it looks different than any other arbitrary token

twin heron Apr 18, 2024, 4:12 AM

#

charred bluff that would not explain why it looks different than any other arbitrary token

weight decay, as you said

charred bluff Apr 18, 2024, 4:18 AM

#

twin heron weight decay, as you said

i mean: the trajectory for "I" reflects that it's trained over because it's irregular, the trajectory for that one seems to indicate just decay as if it's not in the data, but if it's not in the data it shouldn't be in the tokenizer

#

i am also not sure weight decay would touch it if it's never activated, maybe?

#

depends on implementation

#

i am not extremely motivated to chase down these edge cases though

acoustic elm Apr 18, 2024, 4:19 AM

#

I don't know of a WD implementation that leaves un-activated weights alone

charred bluff Apr 18, 2024, 4:22 AM

#

acoustic elm I don't know of a WD implementation that leaves un-activated weights alone

You'd know better than me

twin heron Apr 18, 2024, 4:50 AM

#

charred bluff i mean: the trajectory for "I" reflects that it's trained over because it's irre...

"but if it's not in the data it shouldn't be in the tokenizer"

Not in this case, since it is an added token, not one created by the tokenization process

#

https://huggingface.co/EleutherAI/pythia-6.9b/raw/main/tokenizer.json

charred bluff Apr 18, 2024, 5:12 AM

#

twin heron "but if it's not in the data it shouldn't be in the tokenizer" Not in this case...

I think I misunderstood an earlier bit

slender spoke Apr 18, 2024, 6:37 AM

#

twin heron "but if it's not in the data it shouldn't be in the tokenizer" Not in this case...

yeah, it's a special token added by hand.

charred bluff Apr 18, 2024, 3:56 PM

#

twin heron weight decay, as you said

also, and i may be totally wrong, but does it not seem maybe problematic if tokens not being activated are being weight decayed per timestep? if a token appears very rarely it will decay substantially between optimization steps

twin heron Apr 18, 2024, 4:18 PM

#

charred bluff also, and i may be totally wrong, but does it not seem maybe problematic if toke...

A token that appears very rarely should not be a token

#

But it is a good point. If someone creates a token that will only be used during fine-tuning, that could be a problem

charred bluff Apr 18, 2024, 4:22 PM

#

twin heron A token that appears very rarely should not be a token

assuming tokens are distributed sort of zipf-ily with no weird outliers you still have a pretty vast gap between how much of the token's delta will be weight decay vs gradient for the top and bottom of the distribution

#

they are in pretty different training regimes

twin heron Apr 18, 2024, 4:24 PM

#

To avoid that problem you could apply weight decay to the gradient instead

#

But it is rare that a token doesn’t appear for tens of consecutive updates

#

Interesting problem, needs more research

charred bluff Apr 18, 2024, 5:23 PM

#

twin heron But it is rare that a token doesn’t appear for tens of consecutive updates

if it appears in every update once but another token appears in every update ten times, the less common token will have a much noisier grad relative to its decay, basically the question would be if noise + decay is higher than signal and if so how much higher

slender spoke Apr 19, 2024, 5:34 AM

#

twin heron <|endoftext|> doesn't carry any information and can be used in any context, toke...

Your hypothesis is correct. Comparisons of the end of text token, " and", "\n" and " developers", noting the scales are different between the different tokens.

slender spoke Apr 28, 2024, 4:43 PM

#

I don't know if this is helpful at all, but here is a gif of changing dimension values over time for the token " Mon", token id 4200. points on the 2d space are adjacent dimensions plotted as x and y. Not the most useful, but interesting nonetheless.

gaunt plaza Jul 10, 2024, 10:11 PM

#

Back from the dead guys, finally POPd and will have block leave next week where I can spend all my days wrapping up the project. If you guys are still interested that is, because I’m definitely a bit brainrotted not having touched any code for the past 13 weeks

gaunt plaza Jul 16, 2024, 10:28 AM

#

sadge

#What is learned, when?