Semantic search implies creating of embedding (vector representation of text), so then we can create similarity score between documents.
Let's suppose we have array of blog posts. Our goal is to implement semantic search over this array by creating embeddings. If we decide to use embedding for the whole post, we can encode too much irrelevant information or topics covered in this post could be too wide.
Several ideas just came to my mind:
- Creating a summary of a post and creating an embedding for the summary
- Semantic text segmentation of post and thus using several vectors to encode semantics of a particular post
- Creating summary for each semantic segment
I think this question is not a trivial one and I would be very grateful if you share your opinion.