#Getting stated with Embeddings
1 messages · Page 1 of 1 (latest)
Hi @keen iron. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
Sure, I can guide you on how to get started with embeddings for searching a large body of text. Here's a step-by-step guide based on the information extracted from OpenAI's resources:
-
Prepare your search data: This is a one-time process per document. You need to collect your data, chunk it into smaller sections, embed each section using the OpenAI API, and then store these embeddings. For large datasets, consider using a vector database. This process is detailed in the OpenAI Cookbook example (https://cookbook.openai.com/examples/question_answering_using_embeddings#full-procedure).
-
Perform the search: Given a user query, generate an embedding for the query using the OpenAI API. Then, using these embeddings, rank the text sections by relevance to the query. This is also explained in the OpenAI Cookbook example (https://cookbook.openai.com/examples/question_answering_using_embeddings#full-procedure).
Here's a more detailed breakdown of the process:
-
Before the search (precompute):
- Split your text corpus into chunks smaller than the token limit (8,191 tokens for
text-embedding-ada-002) - Embed each chunk of text
- Store those embeddings in your own database or in a vector search provider like Pinecone, Weaviate, or Qdrant
- Split your text corpus into chunks smaller than the token limit (8,191 tokens for
-
At the time of the search (live compute):
- Embed the search query
- Find the closest embeddings in your database
- Return the top results
This process is explained in the OpenAI Cookbook article (https://cookbook.openai.com/articles/text_comparison_examples).
- Use the embeddings: Embeddings can be used for semantic search, recommendations, cluster analysis, near-duplicate detection, and more. They can be used either by themselves or as a feature in a larger system. The cosine similarity of embeddings can be used as one feature among many in ranking search results.
For more information, you can read OpenAI's blog post announcements:
- Introducing Text and Code Embeddings (Jan 2022) (https://openai.com/blog/introducing-text-and-code-embeddings/)
- New and Improved Embedding Model (Dec 2022) (https://openai.com/blog/new-and-improved-embedding-model/)
For comparison with other embedding models, see the Massive Text Embedding Benchmark (MTEB) Leaderboard (https://huggingface.co/spaces/mteb/leaderboard).
Remember, the exact costs will depend on the system specifics and usage patterns. For gpt-3.5-turbo using ~1,000 tokens per query, it costs ~$0.002 per query, or ~500 queries per dollar (as of Apr 2023). For gpt-4, again assuming ~1,000 tokens per query, it costs ~$0.03 per query, or ~30 queries per dollar (as of Apr 2023) as mentioned in the OpenAI Cookbook example (https://cookbook.openai.com/examples/question_answering_using_embeddings#full-procedure).
I hope this helps you get started with embeddings for text search!
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@keen iron kapa.ai is still learning and improving, please let us know how it did by reacting below
@silk willow how should I create chunks to embed? How do I basically decide the context length of each chunk?
Let me check the knowledge base for you. One moment! :books:
Creating chunks for embedding involves dividing your text into smaller sections that are within the token limit of the model you're using. For the text-embedding-ada-002 model, the maximum context length is 8191 tokens.
Here's a Python function from the OpenAI Cookbook (https://cookbook.openai.com/examples/embedding_long_inputs#2-chunking-the-input-text) that breaks up a sequence into chunks:
from itertools import islice
def batched(iterable, n):
"""Batch data into tuples of length n. The last batch may be shorter."""
# batched('ABCDEFG', 3) --> ABC DEF G
if n < 1:
raise ValueError('n must be at least one')
it = iter(iterable)
while (batch := tuple(islice(it, n))):
yield batch
You can use this function to create chunks of your text. The n parameter is the size of each chunk. You can set this to the maximum context length of your model (8191 for text-embedding-ada-002).
The context length of each chunk is essentially the number of tokens in each chunk. You decide this based on the maximum context length of the model you're using. If your text is longer than this maximum context length, you'll need to divide it into chunks that are within this limit.
Remember, the choice of chunk size can depend on the nature of your text and what you're trying to achieve. For example, if you're working with a document like The Bible, which has many verses, you might want to consider reducing the chunk size to capture more granularity. However, there's no one-size-fits-all answer here, and you might need to experiment to see what works best for your specific use case.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@blissful sinew kapa.ai is still learning and improving, please let us know how it did by reacting below
@silk willow what if while creating chunks, words end up breaking at the end and the start of two consecutive chunks? Would it not hurt the context and coherence of the text?
Let me check the knowledge base for you. One moment! :books:
Yes, you're correct. If words are split between chunks, it can disrupt the context and coherence of the text. This is because the model treats each chunk independently, and it won't understand that a word from the end of one chunk is connected to the beginning of the next chunk.
To avoid this, you can implement a strategy to ensure that chunks are split at natural boundaries in the text, such as at the end of sentences or paragraphs. This way, each chunk is a coherent piece of text in itself.
Here's a simple Python function that splits a text into chunks at sentence boundaries:
import re
def chunk_text(text, max_tokens):
"""Split text into chunks of size max_tokens at sentence boundaries."""
sentences = re.split('(?<=[.!?]) +', text)
chunks = []
current_chunk = ''
for sentence in sentences:
if len(current_chunk) + len(sentence) > max_tokens:
chunks.append(current_chunk)
current_chunk = sentence
else:
current_chunk += ' ' + sentence
chunks.append(current_chunk)
return chunks
In this function, max_tokens is the maximum number of tokens you want in each chunk. The function uses a regular expression to split the text into sentences, then iterates over the sentences, adding each one to the current chunk until adding another sentence would exceed the maximum number of tokens. At that point, it starts a new chunk.
Please note that this is a simple example and might not work perfectly for all types of text. For example, it assumes that sentences always end with a period, exclamation mark, or question mark, which might not be true in all cases. You might need to adjust the function to better suit your specific needs.