#Embedding-space semantic pollution: measurement and alternative tokeniz
1 messages · Page 1 of 1 (latest)
It feels like this project is based on a false premise to me.
I disagree with the contention that “azure” and “Microsoft” are semantically unrelated. They’re semantically extremely closely related, as are “apple” and “iPhone”
We can have a debate about whether “azure” should be more similar to “blue” than “Microsoft” but the idea that there isn’t a semantic connection between “azure” and “Microsoft” is nonsense.
The fact that sometimes we have the same set of symbols and they refer to different concepts is widely viewed as an inherent phenomenon of language. You can find this in every natural language. We generally expect AIs – like humans – to use context to disambiguate
There is also not really such thing as a language-agnostic concept. If you take the word “blue” and translate it into Hebrew the swath of colors it refers to (according to native speakers) changes. Russian puts the boundary between green and yellow in a different place than English does. Turkish uses a different word for a sibling when they’re older than you vs younger than you. In Japanese the hand and the arm are considered the same body part, not two separate connected body parts the way they are in English.
Your post is entirely about language models, but this is a vision data set
Why does it matter if the relationship between azure and Microsoft is of a different kind than the relationship between azure and blue?
I don’t see why that would be relevant to how a LLM is designed.
“That nonsense” means “the phenomenon of different words being represented by the same symbols” right?
But those things that you claim should be different tokens are represented in English using the same symbols
How are you supposed to do this disambiguation anyways? Even if I grant your premises, when you feed text into the model it needs to know which set of tokens to use for the string “azure”
If I wanted to hear an AI make shit up about your project I would have asked an AI.
Do you have a small functional model trained using this?
Why are you asking Claude what you should say
Having anxiety doesn’t affect your repudibility but asking Claude what you should say does
Why are you in an open source community if you’re not willing to share your implementation
This forum is for doing open source projects with other people in the community.
What’s a leaf branch
What are you talking about
It looks like the SOTA on WSD is ~85%… it seems to me like a 15% error rate in your preprocessing would be catastrophic: https://arxiv.org/abs/2503.08662
Word Sense Disambiguation (WSD) is a historical task in computational linguistics that has received much attention over the years. However, with the advent of Large Language Models (LLMs), interest in this task (in its classical definition) has decreased. In this study, we evaluate the performance of various LLMs on the WSD task. We extend a pre...
I have some notes but even claude sonnet is too dumb to run the toolchain and generate reliable data
A 70M model trained for 10B tokens should get meaningfully non-trivial performance on LAMBADA
Note that random chance performance is 0% so you won’t see signal for a bit. But Pythia 70M (which isn’t very good for a 70M model) gets 20% after 10B tokens
I am sure you’ll learn a lot about whether this idea is promising by training some models