#chat with PDF
156 messages · Page 1 of 1 (latest)
hey @dusky iron
prompt = "what is the title of the document?"
# generate an embedding for the input and retrieve the most relevant doc
response = ollama.embed(
model="nomic-embed-text",
input=prompt
)
# print(f"Query embedding: {response['embeddings'][:5]}")
results = collection.query(
query_embeddings=response["embeddings"],
n_results=5
)
data = results['documents'][0]
for i,d in enumerate(results['documents'][0]):
print(i, " -> ", d)
I checked this using print statements, but even the "relevant searches" are bad
so problem is here, from the whole code
any suggestions?
don't use embeddings to try to find the title of a document
so I mean, the prompt would be anything , this is just example
https://ollama.com/blog/embedding-models I am referring this actually
because similarity of "what is the title of the document?" to "Some Article Title is All you need" will ALWAYS be low
also, what are you embedding?
it's really difficult for me to find a problem in the code if I see only 10 lines of it
I am creating chunks from the PDF files, and then converting into embeddings
I have also sent the whole code!
just above it
ok but what are the queries you are running?
because the one you showed me so far is just bad
so for now!
the pdf has small story in it with first page as its title
so the query is just to fetch that title of the document
it will never fetch that
because of this
ohh, so what should I do ?
I mean the ideal workflow?
the query you are trying to use just won't work with semantic search
I mean... maybe to some extent it could since titles do often have some special properties
but titles also usually do not end with a '(?<=[.!?]) +'
have you looked at your chunks?
lemme share few
if you want to find a title specifically, you might need to generate / copy a bunch of titles, embed those, find the mean
1 -> It appeared that the Traveler had responded to the invitation of the Commandant only out of politeness, when he had been invited to attend the execution of a soldier condemned for disobeying and insulting his superior.
2 -> Of course, interest in the execution was not very high, not even in the penal colony itself.
3 -> At least, here in the small, deep, sandy valley, closed in on all sides by barren slopes, apart from the Officer and the Traveler ther e were present only the Condemned, a vacant -looking man with a b road mouth and dilapidated hair and face, and the Soldier, who held the heavy chain to which were connected the small chains which bound the Condemned Man by his feet and wrist bones, as well as by his neck, and which were also linked to each other by connecting chains.
4 -> The Condemned Man had an expression of such dog -like resignation that it looked as if one could set him free to roam around the slopes and would only have to whistle at the start of the execution for him to return.
5 -> The Traveler had little i nterest in the apparatus and walked back and forth behind the Condemned Man, almost visibly indifferent, while the Officer took care of the final preparations.
so this are before converting into embeddings
yeah. so embeddings are not magic
you can compute cosine similarity between all of these
but the main goal was, to chat with PDF using llama models
also, just fyi, vector databases are also returning approximate top results
yeah, I noticed this also
you can, just not about the title
if you want to be able to do that, you will need to do more preprocessing either on the query or the text itself
e.g. just adding <title>IN THE PENAL COLONY BY FRANZ KAFKA TRANSLATED BY IAN JOHNSTON 1919 IN THE PENAL COLONY</title> would probably work
or using HyDE might work
then you should probably start asking for more top results and then reweighing them
so at current scenario!
-> I am converting sentences ( chunks ) into embeddings
-> storing it in vector database
-> generating relevant docs ( which is getting wrong )
-> and then adding this as data to the query for llama model
but HyDE seems to be interesting
but the goal should remain same
that we have to chat with PDF's
so user will just upload the PDF and then start to question/answer with it ( using llama model )
great. well, then you need to improve every single part of your pipeline
because most of this stuff works roughly 50% of the time in the default configuration
do HyDE also needs embeddings?
because here they are just using openAI api
also thanks for your time!
it uses this
yes
and?
only if you want to have more pain
it's called reading
this abstract contains the whole technique
the whole implementation is in 2 files, with like 20 relevant lines of code at most
this is good to understand
wait, what is generated document?
and how it is being generated?
is this due to Contriever?
in its default form HyDE generates a document to answer your query
which is not what you want at all
you can use langchain blindly following braindead medium articles if you want, but you will only encounter pain ;)
ohh so whats the best way?
thinking
for yourself
reading
and understanding what the code you are using is doing
but why it generated a diifferent document then?
so that it can embed them
average the embeddings
and use that average embedding as a surrogate query
Q -> Q*
and search for that query instead
ahh I am still confuse about this
if you embed A COMPREHENSIVE OVERVIEW OF MACHINE LEARNING APPLICATIONS, you are much more likely to find 0 -> IN THE PENAL COLONY BY FRANZ KAFKA TRANSLATED BY IAN JOHNSTON 1919 IN THE PENAL COLONY “It’s a peculiar apparatus,” said the Officer to the Traveler, gazing with a certain admiration at the device, with which he was, of course, thoroughly familiar.
which is wrong?
which is correct
your question was "what is the title of the document?"
I embed a hypothetical answer A COMPREHENSIVE OVERVIEW OF MACHINE LEARNING APPLICATIONS
and find the real answer
do I need to do this everytime?
idk, you need to test and see
if you want to be able to answer "what is the title of the document?", you need to do something to improve the quality of search
I gave you a couple ideas
there is no medium article that will teach you how to do it with high accuracy
first of all because medium article writers are all slop peddlers nowadays
and second because it is actually really hard and people who get to say 90% accuracy for such tasks won't be spilling the beans
yes
can you give me , abstract workflow for this?
you gave lots of
you generate N hypothetical answers to that query
where to start?
so I should implement this HyDE right? I mean first test it
you should start by implementing some kind of test suite, ideally.
because testing on 1 query that is difficult for semantic search is kind of a bad idea
most people will know what the title is
and they will be asking about the contents of the story
yeah
so start by making a set of questions, documents, and answers
like most queries like
give me context of the document
what is in the document
and all
how can I create documents then?
summarization is also impossible with semantic search btw
just download some pdfs
find/download/use existing ones in your system idk
ohkk , got it
so I do have few PDF's to test
so as you said
create questions ( by my own ) and appropriate answers for it
right?
yes
and how it will help?
and some questions that are irrelevant/can't be answered in the context of the document
by giving you an estimate of the quality
so that when you change things, you can see whether the quality improved
and not do things blindly
and what's after that actually?
because this is just manual work ,which will take me few minutes
after that you start making changes
changes? where?
assuming, I have just created some question ( relevant and non-relevant ) and answers too
read this
it doesn't tell you exactly what to do, but should give you some good ideas
you will be evaluating on that set of triples, which means that ideally you will need a way to quickly test your system. which means you will need a way to compare answers with your current system.
you can do it manually, but that will be somewhat painful
so ideally you would use a sufficiently smart LLM as a judge
or you would just pattern match answers somehow/use semantic similarity on those
the end goal is to have a way to evaluate performance of your RAG
until you have that, all improvements are largely meaningless, since it is impossible to tell if they even work
if you want to evaluate JUST the search, you can instead do something different
you can collect a list of (query, document, relevant passage and its location)
and then measure the performance of your search
ohhk that's cool then
in particular, using https://en.wikipedia.org/wiki/Mean_reciprocal_rank or any other method
The mean reciprocal rank is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer: 1 for first place, 1⁄2 for second place, 1⁄3 for third ...
so will now, use a document and will try the HyDE method
so considering this examples websites, which methods do they use?
because , this one also failes to retrieve the title of document which is fine, because it is now able to explain story in the document
well if I was doing it from scratch, I would either use docling + custom chunker + semantic search + keyword search
or... gemini flash
yes