#How does a sentence transformer actually work?

4 messages · Page 1 of 1 (latest)

keen orchid Nov 30, 2022, 3:09 AM

Hello all!

Could someone please explain me intuitively on a high level on how NLP works?

My main gap is in understanding the process of explaining the output of BERT using an input like "Great news for the enemies". How can you get a vector out of that text and in what dimension will it be?

I would really appreciate someone's valuable feedback on that 🙏

gentle totem Dec 1, 2022, 1:43 PM

So finally got back to this. There's actually a lot to unpack here. First, let's talk about how NLP works in general, because everything happened with transformer revolution is a subset of how NLP is done in ML field. When dealing with text, the major steps are 1) Convert text into tokens 2) Convert token into vector for machines to read 3) Perform whatever you'd like to do in NLP. You can tokenize text into separate words or character level, 2-gram (the, united, states, the united, united states, are 5 different tokens), 3-gram, or for much of the transformer models do with byte-pair encoding. These tokens also make up the vocabulary. When you have an unseen token, it's called out-of-vocabulary. So the first big decision you make is about tokenization method.

Then, you'd make the decision about which approach you'd like to vectorize your tokens. In the simplest form, you can vectorize by Bag-of-Words, which is basically count the number of occurrences for each token in each document. Word2vec is still quite useful in many industrial cases, especially when length is a concern. Which one to choose has different trade-offs, and you can read more about the evolution of word vectors here: https://medium.com/co-learning-lounge/nlp-word-embedding-tfidf-bert-word2vec-d7f04340af7f

Now, these vectors are basically features. When deciding which ones to use, first, consider what data they were trained on. If you have a domain specific task, you probably would want something that was trained on domain-specific data, like legal. You should also consider the nature of the task. Certain short-text classification tasks can be done with rule-based system or simple training with fasttext, then you won't need a transformer model.

Medium

A Complete Guide To Understand Evolution of Word to Vector

Walkthrough of work embedding from Bag of words, Word2vec, Glove, BERT, and more in NLP

If you're doing classification, like NER or sequence classification, then that's more on the discriminative realm. If you're doing something like summarization or text-generation, that is generative, like copywriting or code generation. If you're doing something like topic modelling, you'd still get an embedding and then cluster. Recall that vectors are basically features, which featurization method you opt for dictate your clustering results. LDA topic modelling typically tokenize on n-gram, vectorize with TF-IDF, and then perform LDA to cluster.

Now, specific to transformer embedding, it's more about how embedding works in general rather than NLP. In a gist of it, embedding layer is like a look up table. For Key1, return an N-dimension vector. N dimension is a parameter to be set at training. Weights for the lookup table constantly updates during training. In transformer embedding, each token has a token ID, which basically is a key. You pass 5 tokens in, on a 768-dimension embedding, then you would get five 768-dimensional vectors. You can see some discussion here: #💬・general-discussions message . The embedding represents the latent features of that token. Since you can set N, the higher the N, the deeper the latent features you're trying to capture. But if you don't have as much variance in data (say you only have 10 vocabulary and only so many combinations of keys), then having too high of N does you diservice just like overfitting.

Lastly, sentence-transformer is a special case, because they did further training on top of BERT embeddings. Recall BERT embeddings are just featurizer based on BERT training data, SBERT basically turned sentences into BERT embeddings, apply Siamese Network, and then the output is basically similarities of two sentences with consistent dimensions: https://www.sbert.net/docs/training/overview.html

There's a lot of topics I'm not covering here. If you're interested, also check out:
SpaCy free course: https://course.spacy.io/en/
Huggingface free course: https://huggingface.co/course/chapter1/1

Advanced NLP with spaCy

Advanced NLP with spaCy · A free online course

spaCy is a modern Python library for industrial-strength Natural Language Processing. In this free and interactive online course, you'll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches.

Introduction - Hugging Face Course