Ways to optimize semantic search | Learn AI Together | Page 1

tacit charm Nov 28, 2022, 1:38 PM

#

Semantic search implies creating of embedding (vector representation of text), so then we can create similarity score between documents.

Let's suppose we have array of blog posts. Our goal is to implement semantic search over this array by creating embeddings. If we decide to use embedding for the whole post, we can encode too much irrelevant information or topics covered in this post could be too wide.

Several ideas just came to my mind:

Creating a summary of a post and creating an embedding for the summary
Semantic text segmentation of post and thus using several vectors to encode semantics of a particular post
Creating summary for each semantic segment

I think this question is not a trivial one and I would be very grateful if you share your opinion.

echo ferry Nov 28, 2022, 8:32 PM

#

I tried doing exactly this - encoding entire document - and it worked, but it worked better after splitting the document into sentences.

#

I think most important thing is to be able to quantify the quality of result, otherwise you are flying blind. There is different models to chose from, different ways to calculate similarity, and different ways to split text, and your text might have some specific properties too, so there will not be a single best solution.

dire herald Nov 29, 2022, 1:22 PM

#

Is this for a specific use case or general?

still igloo Dec 6, 2022, 10:39 PM

#

Option 1 seems the most straight forward, though it requires you to have a summary of each post. You can use a model like BERT to embed everything to vector space and use something like faiss or scann for fast semantic search between vectors. It's the example I see the most online.

echo ferry Dec 6, 2022, 11:18 PM

#

or you could maybe just average embeddings for all sentences or paragraphs?

still igloo Dec 7, 2022, 12:29 AM

#

echo ferry or you could maybe just average embeddings for all sentences or paragraphs?

Would that even work? I feel like that would actually ruing the meaning of the embeddings and result in lower accuracy for the search.

echo ferry Dec 7, 2022, 12:35 AM

#

still igloo Would that even work? I feel like that would actually ruing the meaning of the e...

averaging is a time-tested strategy for dealing with having too many things 🙂 For example https://www.sbert.net/docs/training/overview.html says:

Different pooling options are available, the most basic one is mean-pooling: We simply average all contextualized word embeddings BERT is giving us.

#

it's simple, and it (sometimes) works.

#

you can also try maximum or other ways to "pool." I think maximum would make sense if the numbers represented "strength of signal" - it would give you an aggregated sense of which signals are present. CNNs use max pooling layers a lot, and it works well.

still igloo Dec 7, 2022, 12:59 AM

#

Wow! I didnt know that

#

Most of the optimization I'm aware of for semantic search is performing the KNN on the database of embeddings

echo ferry Dec 7, 2022, 1:10 AM

#

as far as I know if you use transformers for example, you can get pooled output for classification. BERT gives you the CLS token, but apparently it's just as good to just take all the outputs and average them - I've read something about that.

#

but yeah I think the performance could be bad if your query is short and the document is long.

#

no harm trying, but like I said, when I tried it, it was far better to encode sentence by sentence, rather than document by document.

#

I like your "paraphrasing" idea but it sounds complicated. Is there a reason you don't want to encode sentence by sentence/ paragraph by paragraph?

severe goblet Dec 8, 2022, 12:40 AM

#

Another strategy would be to use a "global memory" in the attention structure of the transformers. That's what the longformers do.

#

(The figures are from Tianyang Lin et al. (2021). A Survey of Transformers. In: CoRR abs/2106.04554. arXiv: 2106.04554)

#

Now I understand that "semantic search" in 2023 means what you asked. My gripe is that in the 2000s it meant something different and what we now call "semantic search" was called QBE (query-by-example). The semantic search work I did was in having queries that contain open-ended semantic types of information being sought. Like searching for an email containing a phone number (which is still not available in GMail as of now).

#

This type of stuff: http://duboue.net/papers/Semantic Search with XML Fragments Columbia.pdf

#

I wonder how can we add this type of semantics to the embeddings stuff. Maybe with some disentangled representation learning but I haven't seen that with transformers (I haven't looked for it either)

echo ferry Dec 8, 2022, 12:49 AM

#

oh longformer. Cool.

#Ways to optimize semantic search