chat with PDF | Learn AI Together | Page 1

hollow grail Feb 9, 2025, 6:59 PM

#

embedding are wrong I Guess

#

https://paste.pythondiscord.com/KKHQ

#

hey @dusky iron

#

prompt = "what is the title of the document?"

# generate an embedding for the input and retrieve the most relevant doc
response = ollama.embed(
  model="nomic-embed-text",
  input=prompt
)
# print(f"Query embedding: {response['embeddings'][:5]}")

results = collection.query(
  query_embeddings=response["embeddings"],
  n_results=5
)
data = results['documents'][0]
for i,d in enumerate(results['documents'][0]):
   print(i, " -> ", d)

#

I checked this using print statements, but even the "relevant searches" are bad

#

so problem is here, from the whole code

#

any suggestions?

dusky iron Feb 9, 2025, 7:01 PM

#

don't use embeddings to try to find the title of a document

hollow grail Feb 9, 2025, 7:01 PM

#

so I mean, the prompt would be anything , this is just example

#

https://ollama.com/blog/embedding-models I am referring this actually

dusky iron Feb 9, 2025, 7:01 PM

#

because similarity of "what is the title of the document?" to "Some Article Title is All you need" will ALWAYS be low

#

also, what are you embedding?

#

it's really difficult for me to find a problem in the code if I see only 10 lines of it

hollow grail Feb 9, 2025, 7:02 PM

#

I am creating chunks from the PDF files, and then converting into embeddings

hollow grail Feb 9, 2025, 7:03 PM

#

dusky iron it's really difficult for me to find a problem in the code if I see only 10 line...

I have also sent the whole code!

#

just above it

dusky iron Feb 9, 2025, 7:03 PM

#

ok but what are the queries you are running?

#

because the one you showed me so far is just bad

hollow grail Feb 9, 2025, 7:04 PM

#

so for now!
the pdf has small story in it with first page as its title

so the query is just to fetch that title of the document

dusky iron Feb 9, 2025, 7:05 PM

#

it will never fetch that

dusky iron Feb 9, 2025, 7:05 PM

#

dusky iron because similarity of "what is the title of the document?" to "Some Article Titl...

because of this

hollow grail Feb 9, 2025, 7:05 PM

#

ohh, so what should I do ?
I mean the ideal workflow?

dusky iron Feb 9, 2025, 7:05 PM

#

the query you are trying to use just won't work with semantic search

#

I mean... maybe to some extent it could since titles do often have some special properties

#

but titles also usually do not end with a '(?<=[.!?]) +'

#

have you looked at your chunks?

hollow grail Feb 9, 2025, 7:07 PM

#

lemme share few

dusky iron Feb 9, 2025, 7:08 PM

#

if you want to find a title specifically, you might need to generate / copy a bunch of titles, embed those, find the mean

hollow grail Feb 9, 2025, 7:09 PM

#

1  ->  It appeared that the Traveler had responded to the invitation of the Commandant only out of politeness, when he had been invited to attend the execution of a soldier condemned for disobeying and insulting his superior.
2  ->  Of course, interest in the execution was not very high, not even in the penal colony itself.
3  ->  At least, here in the small, deep, sandy valley, closed in on all sides by barren slopes, apart from the Officer and the Traveler ther e were present only the Condemned, a vacant -looking man with a b road mouth and dilapidated hair and face, and the Soldier, who held the heavy chain to which were connected the small chains which bound the Condemned Man by his feet and wrist bones, as well as by his neck, and which were also linked to each other by connecting chains.
4  ->  The Condemned Man had an expression of such dog -like resignation that it looked as if one could set him free to roam around the slopes and would only have to whistle at the start of the execution for him to return.
5  ->  The Traveler had little i nterest in the apparatus and walked back and forth behind the Condemned Man, almost visibly indifferent, while the Officer took care of the final preparations.

#

so this are before converting into embeddings

dusky iron Feb 9, 2025, 7:09 PM

#

yeah. so embeddings are not magic

#

you can compute cosine similarity between all of these

hollow grail Feb 9, 2025, 7:10 PM

#

but the main goal was, to chat with PDF using llama models

dusky iron Feb 9, 2025, 7:10 PM

#

also, just fyi, vector databases are also returning approximate top results

hollow grail Feb 9, 2025, 7:10 PM

#

dusky iron also, just fyi, vector databases are also returning _approximate top results_

yeah, I noticed this also

dusky iron Feb 9, 2025, 7:11 PM

#

hollow grail but the main goal was, to chat with PDF using llama models

you can, just not about the title

#

if you want to be able to do that, you will need to do more preprocessing either on the query or the text itself

#

e.g. just adding <title>IN THE PENAL COLONY BY FRANZ KAFKA TRANSLATED BY IAN JOHNSTON 1919 IN THE PENAL COLONY</title> would probably work

#

or using HyDE might work

dusky iron Feb 9, 2025, 7:13 PM

#

hollow grail yeah, I noticed this also

then you should probably start asking for more top results and then reweighing them

hollow grail Feb 9, 2025, 7:15 PM

#

so at current scenario!

-> I am converting sentences ( chunks ) into embeddings
-> storing it in vector database
-> generating relevant docs ( which is getting wrong )
-> and then adding this as data to the query for llama model

#

but HyDE seems to be interesting

#

https://medium.com/@juanc.olamendy/revolutionizing-retrieval-the-mastering-hypothetical-document-embeddings-hyde-b1fc06b9a6cc

#

but the goal should remain same
that we have to chat with PDF's

dusky iron Feb 9, 2025, 7:17 PM

#

https://arxiv.org/pdf/2212.10496

hollow grail Feb 9, 2025, 7:17 PM

#

so user will just upload the PDF and then start to question/answer with it ( using llama model )

dusky iron Feb 9, 2025, 7:17 PM

#

great. well, then you need to improve every single part of your pipeline

#

because most of this stuff works roughly 50% of the time in the default configuration

hollow grail Feb 9, 2025, 7:18 PM

#

do HyDE also needs embeddings?

#

because here they are just using openAI api

#

also thanks for your time!

#

#

it uses this

dusky iron Feb 9, 2025, 7:20 PM

#

hollow grail do HyDE also needs embeddings?

yes

dusky iron Feb 9, 2025, 7:21 PM

#

hollow grail because here they are just using openAI api

and?

hollow grail Feb 9, 2025, 7:21 PM

#

got it

#

I should use langchain here

dusky iron Feb 9, 2025, 7:21 PM

#

only if you want to have more pain

hollow grail Feb 9, 2025, 7:22 PM

#

come on 🙂

#

then what's the other way?

dusky iron Feb 9, 2025, 7:22 PM

#

#

it's called reading

#

this abstract contains the whole technique

#

#

https://github.com/langchain-ai/langchain/tree/bc5fafa20e0e6eb6619e5f025863d6595a8cdf10/libs/langchain/langchain/chains/hyde

GitHub

langchain/libs/langchain/langchain/chains/hyde at bc5fafa20e0e6eb66...

🦜🔗 Build context-aware reasoning applications. Contribute to langchain-ai/langchain development by creating an account on GitHub.

#

the whole implementation is in 2 files, with like 20 relevant lines of code at most

hollow grail Feb 9, 2025, 7:25 PM

#

dusky iron

this is good to understand

#

wait, what is generated document?

#

and how it is being generated?
is this due to Contriever?

dusky iron Feb 9, 2025, 7:27 PM

#

in its default form HyDE generates a document to answer your query

#

which is not what you want at all

#

you can use langchain blindly following braindead medium articles if you want, but you will only encounter pain ;)

hollow grail Feb 9, 2025, 7:28 PM

#

ohh so whats the best way?

dusky iron Feb 9, 2025, 7:29 PM

#

thinking

#

for yourself

#

reading

#

and understanding what the code you are using is doing

hollow grail Feb 9, 2025, 7:30 PM

#

dusky iron in its default form HyDE generates a document to answer your query

but why it generated a diifferent document then?

dusky iron Feb 9, 2025, 7:30 PM

#

so that it can embed them

#

average the embeddings

#

and use that average embedding as a surrogate query

#

Q -> Q*

#

and search for that query instead

hollow grail Feb 9, 2025, 7:32 PM

#

ahh I am still confuse about this

dusky iron Feb 9, 2025, 7:32 PM

#

https://chatgpt.com/share/67a902c1-1efc-800d-b20e-6633acbd1ded

ChatGPT

ChatGPT - PDF Title Extraction Example

Shared via ChatGPT

#

if you embed A COMPREHENSIVE OVERVIEW OF MACHINE LEARNING APPLICATIONS, you are much more likely to find 0 -> IN THE PENAL COLONY BY FRANZ KAFKA TRANSLATED BY IAN JOHNSTON 1919 IN THE PENAL COLONY “It’s a peculiar apparatus,” said the Officer to the Traveler, gazing with a certain admiration at the device, with which he was, of course, thoroughly familiar.

hollow grail Feb 9, 2025, 7:34 PM

#

which is wrong?

dusky iron Feb 9, 2025, 7:34 PM

#

which is correct

#

your question was "what is the title of the document?"

#

I embed a hypothetical answer A COMPREHENSIVE OVERVIEW OF MACHINE LEARNING APPLICATIONS

#

and find the real answer

hollow grail Feb 9, 2025, 7:35 PM

#

do I need to do this everytime?

dusky iron Feb 9, 2025, 7:35 PM

#

idk, you need to test and see

#

if you want to be able to answer "what is the title of the document?", you need to do something to improve the quality of search

#

I gave you a couple ideas

#

there is no medium article that will teach you how to do it with high accuracy

#

first of all because medium article writers are all slop peddlers nowadays

#

and second because it is actually really hard and people who get to say 90% accuracy for such tasks won't be spilling the beans

hollow grail Feb 9, 2025, 7:38 PM

#

ohkk

#

but I am just confused, as you said langchain will increase the pain

dusky iron Feb 9, 2025, 7:39 PM

#

yes

hollow grail Feb 9, 2025, 7:39 PM

#

can you give me , abstract workflow for this?

dusky iron Feb 9, 2025, 7:39 PM

#

I already did?

#

you start with a query

hollow grail Feb 9, 2025, 7:39 PM

#

you gave lots of

dusky iron Feb 9, 2025, 7:39 PM

#

you generate N hypothetical answers to that query

hollow grail Feb 9, 2025, 7:39 PM

#

where to start?

dusky iron Feb 9, 2025, 7:39 PM

#

you average their embeddings out

#

you use that average embedding to search

hollow grail Feb 9, 2025, 7:40 PM

#

so I should implement this HyDE right? I mean first test it

dusky iron Feb 9, 2025, 7:40 PM

#

you should start by implementing some kind of test suite, ideally.

#

because testing on 1 query that is difficult for semantic search is kind of a bad idea

#

most people will know what the title is

#

and they will be asking about the contents of the story

hollow grail Feb 9, 2025, 7:42 PM

#

dusky iron and they will be asking about the contents of the story

yeah

dusky iron Feb 9, 2025, 7:42 PM

#

so start by making a set of questions, documents, and answers

hollow grail Feb 9, 2025, 7:42 PM

#

like most queries like

give me context of the document
what is in the document

and all

hollow grail Feb 9, 2025, 7:43 PM

#

dusky iron so start by making a set of questions, documents, and answers

how can I create documents then?

dusky iron Feb 9, 2025, 7:43 PM

#

summarization is also impossible with semantic search btw

dusky iron Feb 9, 2025, 7:43 PM

#

hollow grail how can I create documents then?

just download some pdfs

#

find/download/use existing ones in your system idk

hollow grail Feb 9, 2025, 7:44 PM

#

ohkk , got it
so I do have few PDF's to test

so as you said
create questions ( by my own ) and appropriate answers for it

right?

dusky iron Feb 9, 2025, 7:44 PM

#

yes

hollow grail Feb 9, 2025, 7:44 PM

#

and how it will help?

dusky iron Feb 9, 2025, 7:44 PM

#

and some questions that are irrelevant/can't be answered in the context of the document

#

by giving you an estimate of the quality

#

so that when you change things, you can see whether the quality improved

#

and not do things blindly

hollow grail Feb 9, 2025, 7:46 PM

#

and what's after that actually?
because this is just manual work ,which will take me few minutes

dusky iron Feb 9, 2025, 7:46 PM

#

after that you start making changes

hollow grail Feb 9, 2025, 7:47 PM

#

changes? where?

#

assuming, I have just created some question ( relevant and non-relevant ) and answers too

dusky iron Feb 9, 2025, 7:48 PM

#

https://hamel.dev/blog/posts/llm-judge/

Creating a LLM-as-a-Judge That Drives Business Results –

A step-by-step guide with my learnings from 30+ AI implementations.

#

read this

#

it doesn't tell you exactly what to do, but should give you some good ideas

dusky iron Feb 9, 2025, 7:50 PM

#

hollow grail assuming, I have just created some question ( relevant and non-relevant ) and an...

you will be evaluating on that set of triples, which means that ideally you will need a way to quickly test your system. which means you will need a way to compare answers with your current system.

#

you can do it manually, but that will be somewhat painful

#

so ideally you would use a sufficiently smart LLM as a judge

#

or you would just pattern match answers somehow/use semantic similarity on those

#

the end goal is to have a way to evaluate performance of your RAG

#

until you have that, all improvements are largely meaningless, since it is impossible to tell if they even work

#

if you want to evaluate JUST the search, you can instead do something different

#

you can collect a list of (query, document, relevant passage and its location)

#

and then measure the performance of your search

hollow grail Feb 9, 2025, 7:54 PM

#

ohhk that's cool then

dusky iron Feb 9, 2025, 7:55 PM

#

in particular, using https://en.wikipedia.org/wiki/Mean_reciprocal_rank or any other method

Mean reciprocal rank

The mean reciprocal rank is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer: 1 for first place, 1⁄2 for second place, 1⁄3 for third ...

hollow grail Feb 9, 2025, 7:55 PM

#

so will now, use a document and will try the HyDE method

dusky iron Feb 9, 2025, 7:55 PM

#

https://nlp.stanford.edu/IR-book/information-retrieval-book.html

hollow grail Feb 9, 2025, 7:57 PM

#

https://smallpdf.com/

#

so considering this examples websites, which methods do they use?

#

because , this one also failes to retrieve the title of document which is fine, because it is now able to explain story in the document

dusky iron Feb 9, 2025, 8:05 PM

#

well if I was doing it from scratch, I would either use docling + custom chunker + semantic search + keyword search

#

or... gemini flash

hollow grail Feb 9, 2025, 8:06 PM

#

wait what gemini flash?

#

I read it has long context length also?

dusky iron Feb 9, 2025, 8:08 PM

#

yes

#chat with PDF