#Embedding-space semantic pollution: measurement and alternative tokeniz

1 messages · Page 1 of 1 (latest)

obsidian igloo
#

You could take a big corpus like the pile and sample a random subset to use for your experiments

whole kite
#

It feels like this project is based on a false premise to me.

I disagree with the contention that “azure” and “Microsoft” are semantically unrelated. They’re semantically extremely closely related, as are “apple” and “iPhone”

#

We can have a debate about whether “azure” should be more similar to “blue” than “Microsoft” but the idea that there isn’t a semantic connection between “azure” and “Microsoft” is nonsense.

#

The fact that sometimes we have the same set of symbols and they refer to different concepts is widely viewed as an inherent phenomenon of language. You can find this in every natural language. We generally expect AIs – like humans – to use context to disambiguate

#

There is also not really such thing as a language-agnostic concept. If you take the word “blue” and translate it into Hebrew the swath of colors it refers to (according to native speakers) changes. Russian puts the boundary between green and yellow in a different place than English does. Turkish uses a different word for a sibling when they’re older than you vs younger than you. In Japanese the hand and the arm are considered the same body part, not two separate connected body parts the way they are in English.

#

Your post is entirely about language models, but this is a vision data set

whole kite
#

Why does it matter if the relationship between azure and Microsoft is of a different kind than the relationship between azure and blue?

#

I don’t see why that would be relevant to how a LLM is designed.

#

“That nonsense” means “the phenomenon of different words being represented by the same symbols” right?

#

But those things that you claim should be different tokens are represented in English using the same symbols

#

How are you supposed to do this disambiguation anyways? Even if I grant your premises, when you feed text into the model it needs to know which set of tokens to use for the string “azure”

#

If I wanted to hear an AI make shit up about your project I would have asked an AI.

#

Do you have a small functional model trained using this?

#

Why are you asking Claude what you should say

#

Having anxiety doesn’t affect your repudibility but asking Claude what you should say does

#

Why are you in an open source community if you’re not willing to share your implementation

#

This forum is for doing open source projects with other people in the community.

#

What’s a leaf branch

#

What are you talking about

#

It looks like the SOTA on WSD is ~85%… it seems to me like a 15% error rate in your preprocessing would be catastrophic: https://arxiv.org/abs/2503.08662

outer escarp
#

I have some notes but even claude sonnet is too dumb to run the toolchain and generate reliable data

whole kite
#

A 70M model trained for 10B tokens should get meaningfully non-trivial performance on LAMBADA

#

Note that random chance performance is 0% so you won’t see signal for a bit. But Pythia 70M (which isn’t very good for a 70M model) gets 20% after 10B tokens

#

I am sure you’ll learn a lot about whether this idea is promising by training some models