#Pdf Extraction
1 messages · Page 1 of 1 (latest)
Hi @ivory magnet. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:
To extract information from PDFs while maintaining document structure—especially tables, hierarchical text, and nested tables—OpenAI’s GPT-4o vision modality offers a robust solution. Here’s a summary of the recommended approach based on the knowledge sources:
1. Convert PDF Pages to Images
Since GPT-4o does not natively process PDFs, first convert each page to an image (e.g., PNG). This can be done using libraries like pdf2image or PyMuPDF (fitz). Example code for conversion:
import fitz # PyMuPDF
from PIL import Image
import io
import os
import base64
def pdf_to_base64_images(pdf_path):
pdf_document = fitz.open(pdf_path)
base64_images = []
for page_num in range(len(pdf_document)):
page = pdf_document.load_page(page_num)
pix = page.get_pixmap()
img = Image.open(io.BytesIO(pix.tobytes()))
temp_image_path = f"temp_page_{page_num}.png"
img.save(temp_image_path, format="PNG")
with open(temp_image_path, "rb") as image_file:
base64_images.append(base64.b64encode(image_file.read()).decode("utf-8"))
os.remove(temp_image_path)
return base64_images
See full example and explanation
2. Use GPT-4o Vision for Structured Extraction
Pass each base64-encoded image to GPT-4o with a detailed system prompt instructing it to extract all data, preserving tables, hierarchies, and nested structures. The prompt should specify:
- Extract tables as JSON arrays, preserving rows and columns (including blanks as
null). - Maintain hierarchical relationships in the output JSON.
- Do not interpolate or make up data.
Example extraction function:
def extract_invoice_data(base64_image):
system_prompt = """
Extract all data from this document page, preserving tables (as arrays), hierarchical text, and nested structures.
Output as JSON. Include blank fields as null. Do not make up data.
"""
response = client.chat.completions.create(
model="gpt-4o",
response_format={ "type": "json_object" },
messages=[
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "text", "text": "Extract all structured data from this page."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}", "detail": "high"}}
]
}
],
temperature=0.0,
)
return response.choices[0].message.content
See full code and prompt details
3. Combine and Store Results
If your document spans multiple pages, extract each page and combine the results into a single JSON file. This approach preserves the structure across the entire document.
4. Chunking and Embedding (for RAG)
For downstream tasks like Retrieval-Augmented Generation (RAG), chunk the extracted content thoughtfully:
- Chunk by logical sections or table boundaries, not just by token count.
- Preserve metadata about the chunk’s position and hierarchy.
- Store embeddings for each chunk for efficient retrieval.
See chunking and embedding best practices
Summary:
By converting PDF pages to images and using GPT-4o’s vision capabilities with a carefully crafted prompt, you can extract structured data—including complex tables and hierarchies—into JSON. This method is robust for multilingual and visually complex documents, and is well-suited for further processing or RAG pipelines.
Full workflow and code examples here.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@fluid garden i will share the pdf can you able to extract it.
Let me check the knowledge base for you again. One moment! :books:
I'm sorry, but I cannot directly process or extract information from a PDF file that you upload here. My capabilities are limited to providing guidance, code examples, and best practices for PDF extraction using OpenAI tools, based on the knowledge sources provided.
If you share the PDF file, I cannot access or process it myself. However, I can show you how to extract information from your PDF using OpenAI's GPT-4o vision capabilities and Python code. If you need step-by-step instructions or sample code, please let me know!
For example, you can use the workflow described here to extract structured data from your PDF using GPT-4o.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@fluid garden please check this code when i asked question it answer showingQuerying: 'Parking Management'
<ipython-input-40-d29cadbc92dc>:38: LangChainDeprecationWarning: The method BaseRetriever.get_relevant_documents was deprecated in langchain-core 0.1.46 and will be removed in 1.0. Use :meth:~invoke instead.
bm25_results = bm25_retriever.get_relevant_documents(query)
Hybrid Search Results:
Document 1:
File: ZKBio CVSecurity_V6.4.0_R_Datasheet_202411.pdf
Page: 1
Hybrid Score: 0.8707
BM25 Score: 1.0000
Vector Score: 0.6767
Content (first 100 chars): START OF PAGE: 1 # ZKBio CVSecurity ## Version 6.4.0_R All-in-One Biometric Security Advanced Soluti...
Document 2:
File: ZKBio CVSecurity_V6.4.0_R_Datasheet_202411.pdf
Page: 7
Hybrid Score: 0.7155
BM25 Score: 0.8750
Vector Score: 0.4762
Content (first 100 chars): START OF PAGE: 7 # System Topology ```mermaid graph TD A[Standby Server] --> B[Main Server] B --> C[...
Document 3:
File: ZKBio CVSecurity_V6.4.0_R_Datasheet_202411.pdf
Page: 2
Hybrid Score: 0.5988
BM25 Score: 0.7500
Vector Score: 0.1220
Content (first 100 chars): START OF PAGE: 2 # Features of Main Modules Module Features Access Control Access Level: Precisely m...
Document 4:
File: ZKBio CVSecurity_V6.4.0_R_Datasheet_202411.pdf
Page: 5
Hybrid Score: 0.5153
BM25 Score: 0.5000
Vector Score: 0.5383
Content (first 100 chars): START OF PAGE: 5 # Software Specifications ## Operating System Server-Si
de Windows7/8/10/11, Windows...
Document 5:
File: ZKBio CVSecurity_V6.4.0_R_Datasheet_202411.pdf
Page: 6
Hybrid Score: 0.4835
BM25 Score: 0.2500
Vector Score: 0.8337
Content (first 100 chars): START OF PAGE: 6 # Server Configuration Application Scenario Device Protocol Device Quantity Deploym... can u check the pdf and see is my RAG is correct
Let me check the knowledge base for you again. One moment! :books:
I'm sorry, but I cannot directly check or access your PDF file or the actual content of your RAG pipeline results, as you have not provided the PDF itself or its full content. Based on your message, you have shared the output of a hybrid search (BM25 + vector search) over a PDF, and you are asking if your Retrieval-Augmented Generation (RAG) setup is correct.
Based on the knowledge sources, here are some points to consider for validating your RAG pipeline:
-
Hybrid Search Approach:
Combining BM25 (keyword-based) and vector (embedding-based) search is a recommended practice. Hybrid search can improve retrieval accuracy by leveraging both semantic and keyword matches. Your output shows both BM25 and vector scores, which is a good sign that you are using a hybrid approach as described in best practices for retrieval in RAG systems see technical patterns for retrieval. -
Result Presentation:
Displaying the file name, page number, hybrid score, BM25 score, vector score, and a content preview for each result is helpful for debugging and validation. This matches the recommended approach for inspecting retrieval quality. -
Deprecation Warning:
The warning aboutBaseRetriever.get_relevant_documentsbeing deprecated in LangChain suggests you should update your code to use the new.invokemethod, as future versions may remove the old method. -
Content Relevance:
The actual relevance of the retrieved content to your query ("Parking Management") depends on the content of your PDF. If the top results are indeed about parking management, your RAG is working as intended. If not, you may need to adjust chunking, embedding, or retrieval parameters.
- Chunking and Structure:
Make sure your PDF was chunked in a way that preserves logical sections and context, as this impacts retrieval quality. The knowledge sources recommend chunking by logical units (e.g., sections, tables) and adding metadata for better retrieval see chunking and embedding best practices.
Summary:
Your RAG pipeline appears to follow recommended practices for hybrid search and result presentation. However, without access to your actual PDF or the full content of the retrieved chunks, I cannot definitively confirm if the results are correct or relevant to your query. Please review the retrieved content to ensure it matches your expectations for "Parking Management." If you notice irrelevant results, consider refining your chunking strategy, retrieval parameters, or prompt engineering as suggested in the knowledge sources.
If you need more specific help, please provide the relevant text chunks or clarify what aspect of the RAG pipeline you want to validate.
This answer is based solely on the information and best practices described in the provided OpenAI Cookbook sources. I cannot directly access or review your PDF file or its content.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@fluid garden so what i need to do can you provide me code i will try
Let me check the knowledge base for you again. One moment! :books:
Sorry @ivory magnet, I am experiencing especially high traffic right now, please try again. :technologist:
@fluid garden i shared you the pdf also please look into and i also shared all the prompt please check it
Let me check the knowledge base for you again. One moment! :books:
I'm sorry, but I do not have the capability to directly access or review PDF files that you mention or upload. My responses are based solely on the information provided in the text of your messages and the knowledge sources available to me.
From your previous messages, you shared the output of your RAG pipeline and described your process, but you did not provide the actual content of the PDF or the full prompts used. Without the actual PDF content or a detailed transcript of the relevant sections, I cannot verify if your RAG system is retrieving the correct information for your query.
What I can confirm based on your description and the knowledge sources:
- Your use of hybrid search (BM25 + vector search) aligns with best practices for RAG pipelines, as it combines keyword and semantic retrieval for improved accuracy.
- Displaying hybrid, BM25, and vector scores, along with content previews, is a recommended approach for debugging and validation.
- The deprecation warning suggests you should update your code to use the
.invokemethod instead ofget_relevant_documentsin LangChain, as per recent updates.
However, I cannot:
- Check the actual content of your PDF.
- Confirm if the retrieved results are truly relevant to your query without seeing the full text of the retrieved chunks and the original question.
If you want a more specific review, please paste the relevant text chunks or the full prompt and answer as text in your message. Otherwise, based on the knowledge sources, your approach appears correct in structure, but the final validation of content relevance must be done by reviewing the actual retrieved text yourself or by sharing it here in text form for review.
If you need further guidance on best practices for RAG pipelines, chunking, or prompt engineering, let me know!
[This answer is based solely on the information in the provided knowledge sources and your text.]
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@fluid garden see this is my pdf can u help me to make a advance and powerfull RAG
Let me check the knowledge base for you again. One moment! :books:
Certainly! Here’s a step-by-step guide to building an advanced and powerful Retrieval-Augmented Generation (RAG) pipeline for PDF documents, based strictly on the best practices and technical patterns from the OpenAI Cookbook knowledge sources:
1. Data Preparation
Chunking
- Why: Chunking breaks large documents into smaller, manageable pieces, reducing cost and latency, and improving retrieval accuracy.
- How:
- Decide on chunk size and overlap. Consider logical boundaries (sections, tables) rather than just fixed token counts.
- Add metadata (e.g., section titles, page numbers) to each chunk for better context and filtering.
- Avoid too much low-quality content or overly large documents, as these hinder performance.
- Example:
Chunk by page, section, or table, and prepend each chunk with its title or description for context.
See chunking best practices
Augmenting Content
- How: Use LLMs (like GPT-4) to rephrase, summarize, or add bullet points to each chunk. Add descriptive metadata and keywords users might search for.
- Multimodality: Use models like Whisper or GPT-4V to convert audio/visual content to text if your PDFs contain such data.
2. Embedding
- What to Embed: Not just the text, but also metadata (titles, tags, product specs, etc.) to improve searchability.
- How: Use OpenAI’s embedding models to convert each chunk (with metadata) into a vector representation.
- Store: Save these embeddings in a vector database for efficient semantic search.
3. Retrieval
Hybrid Search
- Combine: Use both semantic search (vector similarity) and keyword/entity search (BM25 or similar).
- How:
- Extract entities using NER for deterministic queries.
- Use hybrid search to combine results from both methods, and re-rank them for relevance.
- Re-ranking: Use rule-based or algorithmic re-ranking (e.g., BERT-based, cross-encoder, TF-IDF) to surface the most relevant results.
- Multi-step Retrieval: For complex queries, use a chain-of-thought approach—break down the query, perform multiple retrievals, and combine results as needed.
- Cache: Cache outputs where possible to improve efficiency.
4. Answer Generation
- Context Window: Only include the most relevant chunks in the LLM’s context window. Be mindful of the max context size and the “lost in the middle” problem.
- Prompt Engineering: Experiment with prompts to guide the LLM in using the retrieved context effectively. Provide few-shot examples if needed.
- Safety Checks: Implement guardrails and use frameworks like RAGAS to evaluate faithfulness, answer relevancy, context precision, and recall.
- Optimization: Try prompt engineering and few-shot examples before resorting to fine-tuning.
5. Best Practices & Common Pitfalls
- Best Practices:
- Pre-process content for LLM consumption (summaries, headers).
- Curate high-quality, relevant data sources.
- Evaluate performance after each experiment and iterate.
- Common Pitfalls:
- Too much low-quality content.
- Overly large documents or context windows.
- Skipping prompt engineering and jumping straight to fine-tuning.
Summary Workflow
- Parse and chunk your PDF (by logical sections, with metadata).
- Augment and embed each chunk (text + metadata).
- Store embeddings in a vector DB.
- At query time:
- Use hybrid retrieval (semantic + keyword/entity).
- Re-rank results.
- Select top chunks for the LLM context window.
- Generate answer using LLM, with prompt engineering and safety checks.
You can find detailed code examples and further explanations in the OpenAI Cookbook RAG PDF example and data extraction with GPT-4o vision.
If you have a specific part of your pipeline you want to improve (e.g., chunking, retrieval, answer generation), let me know and I can provide more targeted guidance!
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
Let me check the knowledge base for you again. One moment! :books:
Your question—"is this good enough for chunking"—is a bit open-ended, but based on the knowledge sources, here’s how to evaluate whether your chunking approach is effective:
Best Practices for Chunking:
- Chunk by logical boundaries: Chunking by page is a simple starting point, but for more advanced RAG, consider chunking by sections, tables, or headings to preserve context and structure.
- Add metadata: Include section titles, page numbers, or descriptions at the start of each chunk. This helps retrieval and provides context to the LLM.
- Chunk size and overlap: Choose a chunk size that balances context and efficiency. Overlapping chunks can help avoid missing information at boundaries, but too much overlap increases redundancy and cost.
- Avoid too large or too small chunks: Large chunks may exceed context window limits or dilute relevance; small chunks may lose context or coherence.
- Quality over quantity: Avoid including low-quality or irrelevant content in your chunks.
Common Pitfalls:
- Chunking only by fixed size or page without considering logical document structure can lead to poor retrieval results.
- Not including metadata or context can make it harder for the model to understand the chunk’s relevance.
Summary:
Chunking by page is a reasonable baseline, but for a more powerful and accurate RAG system, you should consider chunking by logical sections, adding metadata, and tuning chunk size and overlap for your specific use case. This aligns with the best practices outlined in the OpenAI Cookbook sources see chunking best practices.
If you share your exact chunking code or strategy, I can provide more specific feedback!
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@fluid garden can you help me on make my rag advance
Let me check the knowledge base for you again. One moment! :books:
Absolutely! To make your RAG (Retrieval-Augmented Generation) system more advanced, you should focus on improvements in data preparation, retrieval, and answer generation, as recommended in the OpenAI Cookbook. Here are actionable steps, strictly based on the knowledge sources:
1. Advanced Chunking and Data Preparation
- Chunk by logical structure: Instead of just splitting by page or fixed size, chunk your documents by sections, tables, or headings. This preserves context and improves retrieval relevance.
- Add metadata: Prepend each chunk with section titles, page numbers, or descriptions. This helps both retrieval and answer generation.
- Augment content: Use LLMs (like GPT-4) to rephrase, summarize, or add bullet points to each chunk. Add keywords or tags users might search for. For multimodal documents, use models like Whisper or GPT-4V to convert audio/visual content to text and generate tags or descriptions for images and slides.
- Avoid common pitfalls: Don’t include too much low-quality content or overly large documents, as this can hinder performance and increase costs.
2. Hybrid and Multi-step Retrieval
- Hybrid search: Combine semantic search (vector embeddings) with keyword/entity search (BM25 or similar). This leverages both semantic and exact matches for better results.
- Re-ranking: After retrieval, re-rank results using rule-based methods (e.g., recency, tags) or algorithms (e.g., BERT-based, cross-encoder, TF-IDF) to surface the most relevant content.
- Multi-step retrieval: For complex queries, break down the retrieval process into multiple steps or use a chain-of-thought approach, guiding the system through sequential actions if needed.
- Cache outputs: Where possible, cache retrieval results to improve efficiency.
3. Optimized Answer Generation
- Context window management: Only include the most relevant chunks in the LLM’s context window. Be mindful of the maximum context size and avoid the “lost in the middle” problem.
- Prompt engineering: Experiment with different prompts and provide few-shot examples to guide the LLM in using the retrieved context effectively.
- Safety checks: Implement guardrails and use frameworks like RAGAS to evaluate answer faithfulness, relevancy, context precision, and recall.
- Iterative optimization: Try prompt engineering and few-shot examples before resorting to fine-tuning.
4. Best Practices
- Pre-process content: Add summaries and headers for each part to make it easier for the LLM to consume.
- Curate data sources: Ensure your knowledge base is high-quality and relevant.
- Evaluate and iterate: After each experiment, assess performance and iterate on your approach.
For more details and code examples, refer to the OpenAI Cookbook RAG PDF example.
If you have a specific part of your pipeline you want to improve (e.g., chunking, retrieval, answer generation), let me know and I can provide more targeted guidance!
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@fluid garden is this correct chunked
Let me check the knowledge base for you again. One moment! :books:
Sorry @ivory magnet, I am experiencing especially high traffic right now, please try again. :technologist:
@fluid garden please check above and tell me is my chunk is good
Let me check the knowledge base for you again. One moment! :books:
Based on the knowledge sources, a good chunking strategy for RAG should:
- Chunk by logical boundaries (such as sections, tables, or headings), not just by page or fixed size.
- Add metadata (like section titles, page numbers, or descriptions) to each chunk to provide context and improve retrieval.
- Balance chunk size and overlap to ensure each chunk is neither too large (which can dilute relevance and exceed context limits) nor too small (which can lose important context).
- Avoid low-quality or irrelevant content in your chunks, as this can negatively impact retrieval and answer generation.
If your current chunking is only by page, it is a reasonable starting point, but the knowledge sources recommend more advanced strategies—such as chunking by logical document structure and including metadata—to make your RAG system more powerful and accurate. You should consider these improvements for better results see chunking best practices.
If you provide your exact chunking code or a sample chunk, I can give more specific feedback. Otherwise, based on the information provided, chunking by page is functional but not optimal for advanced RAG use cases.
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@fluid garden can u provide me a code for that
Let me check the knowledge base for you again. One moment! :books:
Certainly! Here is a code example for advanced chunking of PDF documents, based strictly on the OpenAI Cookbook knowledge sources. This approach chunks by logical boundaries (such as slide titles and descriptions) and adds metadata to each chunk, as recommended in the sources:
# Example: Chunking PDF content by logical sections and adding metadata
content = []
for doc in docs:
# Split text by page (using '\f' as page separator)
text = doc['text'].split('\f')[1:]
description = doc['pages_description']
description_indexes = []
for i in range(len(text)):
slide_content = text[i] + '\n'
# Try to find matching slide description
slide_title = text[i].split('\n')[0]
for j in range(len(description)):
description_title = description[j].split('\n')[0]
if slide_title.lower() == description_title.lower():
slide_content += description[j].replace(description_title, '')
description_indexes.append(j)
# Add the slide content + matching slide description to the content pieces
content.append({
"title": slide_title,
"body": slide_content,
"page_number": i + 1 # Add page number as metadata
})
# Add slide descriptions that weren't used
for j in range(len(description)):
if j not in description_indexes:
content.append({
"title": description[j].split('\n')[0],
"body": description[j],
"page_number": None
})
# Print the chunks for inspection
for c in content:
print(f"Title: {c['title']}\nPage: {c['page_number']}\nContent:\n{c['body']}\n")
print("\n-------------------------------\n")
This code:
- Splits the document by logical boundaries (slide titles and descriptions).
- Adds metadata (title and page number) to each chunk.
- Ensures each chunk is context-rich and ready for embedding or retrieval.
This approach is directly adapted from the OpenAI Cookbook's RAG PDF chunking example.
If your document is not a slide deck, you can adapt the logic to chunk by section headings, tables, or other logical separators relevant to your PDF structure.
If you need further customization, please specify your document structure!
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@fluid garden this is my rag can u provide me more advance level of rag for my project
Let me check the knowledge base for you again. One moment! :books:
Certainly! To make your RAG (Retrieval-Augmented Generation) system more advanced, here are concrete, source-based recommendations and technical patterns you can implement, strictly following the OpenAI Cookbook guidance:
1. Advanced Chunking and Data Preparation
- Chunk by Logical Boundaries:
Instead of only splitting by page or fixed size, chunk your documents by sections, tables, or headings. This preserves context and improves retrieval relevance. For example, use section titles, table boundaries, or slide headers as chunk delimiters. - Add Metadata:
Prepend each chunk with section titles, page numbers, or descriptions. This helps both retrieval and answer generation. - Augment Content:
Use LLMs (like GPT-4) to rephrase, summarize, or add bullet points to each chunk. Add keywords or tags users might search for. For multimodal documents, use models like Whisper or GPT-4V to convert audio/visual content to text and generate tags or descriptions for images and slides. - Avoid Common Pitfalls:
Don’t include too much low-quality content or overly large documents, as this can hinder performance and increase costs.
"Chunking by logical sections, adding metadata, and augmenting content with summaries or keywords are best practices for RAG data preparation."
See chunking best practices
2. Hybrid and Multi-step Retrieval
- Hybrid Search:
Combine semantic search (vector embeddings) with keyword/entity search (BM25 or similar). This leverages both semantic and exact matches for better results. - Re-ranking:
After retrieval, re-rank results using rule-based methods (e.g., recency, tags) or algorithms (e.g., BERT-based, cross-encoder, TF-IDF) to surface the most relevant content. - Multi-step Retrieval:
For complex queries, break down the retrieval process into multiple steps or use a chain-of-thought approach, guiding the system through sequential actions if needed. - Cache Outputs:
Where possible, cache retrieval results to improve efficiency.
"Use a combination of semantic search and deterministic queries where possible. Re-ranking is important to ensure accurate results."
See retrieval and re-ranking patterns
3. Optimized Answer Generation
- Context Window Management:
Only include the most relevant chunks in the LLM’s context window. Be mindful of the maximum context size and avoid the “lost in the middle” problem. - Prompt Engineering:
Experiment with different prompts and provide few-shot examples to guide the LLM in using the retrieved context effectively. - Safety Checks:
Implement guardrails and use frameworks like RAGAS to evaluate answer faithfulness, relevancy, context precision, and recall. - Iterative Optimization:
Try prompt engineering and few-shot examples before resorting to fine-tuning.
"Evaluate performance after each experiment. Implement guardrails if applicable. Avoid jumping straight to fine-tuning without trying other approaches."
See answer generation best practices
4. Example: Advanced Chunking Code
Here’s a code snippet for chunking by logical sections and adding metadata, as recommended:
content = []
for doc in docs:
text = doc['text'].split('\f')[1:]
description = doc['pages_description']
description_indexes = []
for i in range(len(text)):
slide_content = text[i] + '\n'
slide_title = text[i].split('\n')[0]
for j in range(len(description)):
description_title = description[j].split('\n')[0]
if slide_title.lower() == description_title.lower():
slide_content += description[j].replace(description_title, '')
description_indexes.append(j)
content.append({
"title": slide_title,
"body": slide_content,
"page_number": i + 1
})
for j in range(len(description)):
if j not in description_indexes:
content.append({
"title": description[j].split('\n')[0],
"body": description[j],
"page_number": None
})
See full example and explanation
Summary Workflow
- Parse and chunk your PDF (by logical sections, with metadata).
- Augment and embed each chunk (text + metadata).
- Store embeddings in a vector DB.
- At query time:
- Use hybrid retrieval (semantic + keyword/entity).
- Re-rank results.
- Select top chunks for the LLM context window.
- Generate answer using LLM, with prompt engineering and safety checks.
If you have a specific part of your pipeline you want to improve (e.g., chunking, retrieval, answer generation), let me know and I can provide more targeted guidance!
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.
@fluid garden
Let me check the knowledge base for you again. One moment! :books:
Sorry @ivory magnet, I seem to be having a bit of trouble. Don't worry though, I've pinged my support and they will try to fix me up! :technologist:
Perhaps try again in 15mins?
@fluid garden can you gave me code for advance rag so that i can connect with chunk.
Let me check the knowledge base for you again. One moment! :books:
Certainly! Here’s an advanced RAG pipeline code pattern—from chunking to retrieval and answer generation—based strictly on the OpenAI Cookbook knowledge sources. This example assumes you have already chunked your PDF into logical sections with metadata (as in previous code snippets).
1. Embedding Chunks with Metadata
from openai import OpenAI
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
def embed_chunks(chunks):
embeddings = []
for chunk in chunks:
# Combine metadata and content for embedding
text_to_embed = f"Title: {chunk['title']}\nPage: {chunk['page_number']}\n{chunk['body']}"
response = client.embeddings.create(
model="text-embedding-3-large",
input=text_to_embed
)
embeddings.append({
"embedding": response.data[0].embedding,
"metadata": chunk
})
return embeddings
This follows the best practice of embedding both content and metadata for improved retrieval
See embedding best practices
2. Hybrid Retrieval (Semantic + Keyword Search) and Re-ranking
# Pseudocode for hybrid retrieval
def hybrid_retrieve(query, vector_db, keyword_index):
# Semantic search
query_embedding = client.embeddings.create(
model="text-embedding-3-large",
input=query
).data[0].embedding
vector_results = vector_db.similarity_search(query_embedding)
# Keyword search (e.g., BM25)
keyword_results = keyword_index.search(query)
# Combine and re-rank (simple example: union and sort by combined score)
combined = vector_results + keyword_results
combined = sorted(combined, key=lambda x: x['score'], reverse=True)
return combined[:5] # Top 5 results
Combining semantic and keyword search, then re-ranking, is a recommended pattern
See hybrid retrieval and re-ranking
3. Answer Generation with Retrieved Chunks
def generate_answer(query, retrieved_chunks):
# Concatenate top chunks for context
context = "\n\n".join(
[f"Title: {c['metadata']['title']}\nPage: {c['metadata']['page_number']}\n{c['metadata']['body']}" for c in retrieved_chunks]
)
prompt = f"Use the following context to answer the question:\n\n{context}\n\nQuestion: {query}\nAnswer:"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant that answers questions using the provided context only."},
{"role": "user", "content": prompt}
],
temperature=0.0
)
return response.choices[0].message.content
This approach ensures the LLM uses only the retrieved, relevant context for answer generation
See answer generation best practices
Summary Workflow
- Chunk your PDF by logical sections, adding metadata.
- Embed each chunk (content + metadata).
- Store embeddings in a vector database and build a keyword index.
- At query time:
- Retrieve using both semantic and keyword search.
- Re-rank results.
- Pass top chunks as context to the LLM for answer generation.
This code pattern is directly derived from the OpenAI Cookbook’s advanced RAG recommendations and is suitable for connecting with your chunked data.
If you need more details on a specific step (e.g., vector DB setup, keyword index), let me know!
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
@ivory magnet Kapa.ai is still learning and improving, please let me know how I did by reacting below.