#Ways to optimize semantic search

22 messages · Page 1 of 1 (latest)

tacit charm
#

Semantic search implies creating of embedding (vector representation of text), so then we can create similarity score between documents.

Let's suppose we have array of blog posts. Our goal is to implement semantic search over this array by creating embeddings. If we decide to use embedding for the whole post, we can encode too much irrelevant information or topics covered in this post could be too wide.

Several ideas just came to my mind:

  1. Creating a summary of a post and creating an embedding for the summary
  2. Semantic text segmentation of post and thus using several vectors to encode semantics of a particular post
  3. Creating summary for each semantic segment

I think this question is not a trivial one and I would be very grateful if you share your opinion.

echo ferry
#

I tried doing exactly this - encoding entire document - and it worked, but it worked better after splitting the document into sentences.

#

I think most important thing is to be able to quantify the quality of result, otherwise you are flying blind. There is different models to chose from, different ways to calculate similarity, and different ways to split text, and your text might have some specific properties too, so there will not be a single best solution.

dire herald
#

Is this for a specific use case or general?

still igloo
#

Option 1 seems the most straight forward, though it requires you to have a summary of each post. You can use a model like BERT to embed everything to vector space and use something like faiss or scann for fast semantic search between vectors. It's the example I see the most online.

echo ferry
#

or you could maybe just average embeddings for all sentences or paragraphs?

still igloo
echo ferry
#

it's simple, and it (sometimes) works.

#

you can also try maximum or other ways to "pool." I think maximum would make sense if the numbers represented "strength of signal" - it would give you an aggregated sense of which signals are present. CNNs use max pooling layers a lot, and it works well.

still igloo
#

Wow! I didnt know that

#

Most of the optimization I'm aware of for semantic search is performing the KNN on the database of embeddings

echo ferry
#

as far as I know if you use transformers for example, you can get pooled output for classification. BERT gives you the CLS token, but apparently it's just as good to just take all the outputs and average them - I've read something about that.

#

but yeah I think the performance could be bad if your query is short and the document is long.

#

no harm trying, but like I said, when I tried it, it was far better to encode sentence by sentence, rather than document by document.

#

I like your "paraphrasing" idea but it sounds complicated. Is there a reason you don't want to encode sentence by sentence/ paragraph by paragraph?

severe goblet
#

Another strategy would be to use a "global memory" in the attention structure of the transformers. That's what the longformers do.

#

(The figures are from Tianyang Lin et al. (2021).  A Survey of Transformers. In: CoRR abs/2106.04554. arXiv: 2106.04554)

#

Now I understand that "semantic search" in 2023 means what you asked. My gripe is that in the 2000s it meant something different and what we now call "semantic search" was called QBE (query-by-example). The semantic search work I did was in having queries that contain open-ended semantic types of information being sought. Like searching for an email containing a phone number (which is still not available in GMail as of now).

#

I wonder how can we add this type of semantics to the embeddings stuff. Maybe with some disentangled representation learning but I haven't seen that with transformers (I haven't looked for it either)

echo ferry
#

oh longformer. Cool.