So finally got back to this. There's actually a lot to unpack here. First, let's talk about how NLP works in general, because everything happened with transformer revolution is a subset of how NLP is done in ML field. When dealing with text, the major steps are 1) Convert text into tokens 2) Convert token into vector for machines to read 3) Perform whatever you'd like to do in NLP. You can tokenize text into separate words or character level, 2-gram (the, united, states, the united, united states, are 5 different tokens), 3-gram, or for much of the transformer models do with byte-pair encoding. These tokens also make up the vocabulary. When you have an unseen token, it's called out-of-vocabulary. So the first big decision you make is about tokenization method.
Then, you'd make the decision about which approach you'd like to vectorize your tokens. In the simplest form, you can vectorize by Bag-of-Words, which is basically count the number of occurrences for each token in each document. Word2vec is still quite useful in many industrial cases, especially when length is a concern. Which one to choose has different trade-offs, and you can read more about the evolution of word vectors here: https://medium.com/co-learning-lounge/nlp-word-embedding-tfidf-bert-word2vec-d7f04340af7f
Now, these vectors are basically features. When deciding which ones to use, first, consider what data they were trained on. If you have a domain specific task, you probably would want something that was trained on domain-specific data, like legal. You should also consider the nature of the task. Certain short-text classification tasks can be done with rule-based system or simple training with fasttext, then you won't need a transformer model.