I'd love to embed a few sentences with a bunch of non vocabulary words like "LFT_GUG is 5 inches"
In this case LFT_GUG gets broken into multiple characters (due to the nature of the tokenizer) or looses its integrity for cosine similarity.
Would love to see how I can embed tokens instead of sentences so that I can manually make sure such words are not broken down and stay the same in the tokenisation process .
Thanks in advance!