#chat with PDF

156 messages · Page 1 of 1 (latest)

hollow grail
#

embedding are wrong I Guess

#

hey @dusky iron

#
prompt = "what is the title of the document?"

# generate an embedding for the input and retrieve the most relevant doc
response = ollama.embed(
  model="nomic-embed-text",
  input=prompt
)
# print(f"Query embedding: {response['embeddings'][:5]}")

results = collection.query(
  query_embeddings=response["embeddings"],
  n_results=5
)
data = results['documents'][0]
for i,d in enumerate(results['documents'][0]):
   print(i, " -> ", d)

#

I checked this using print statements, but even the "relevant searches" are bad

#

so problem is here, from the whole code

#

any suggestions?

dusky iron
#

don't use embeddings to try to find the title of a document

hollow grail
#

so I mean, the prompt would be anything , this is just example

dusky iron
#

because similarity of "what is the title of the document?" to "Some Article Title is All you need" will ALWAYS be low

#

also, what are you embedding?

#

it's really difficult for me to find a problem in the code if I see only 10 lines of it

hollow grail
#

I am creating chunks from the PDF files, and then converting into embeddings

hollow grail
#

just above it

dusky iron
#

ok but what are the queries you are running?

#

because the one you showed me so far is just bad

hollow grail
#

so for now!
the pdf has small story in it with first page as its title

so the query is just to fetch that title of the document

dusky iron
#

it will never fetch that

hollow grail
#

ohh, so what should I do ?
I mean the ideal workflow?

dusky iron
#

the query you are trying to use just won't work with semantic search

#

I mean... maybe to some extent it could since titles do often have some special properties

#

but titles also usually do not end with a '(?<=[.!?]) +'

#

have you looked at your chunks?

hollow grail
#

lemme share few

dusky iron
#

if you want to find a title specifically, you might need to generate / copy a bunch of titles, embed those, find the mean

hollow grail
#
1  ->  It appeared that the Traveler had responded to the invitation of the Commandant only out of politeness, when he had been invited to attend the execution of a soldier condemned for disobeying and insulting his superior.
2  ->  Of course, interest in the execution was not very high, not even in the penal colony itself.
3  ->  At least, here in the small, deep, sandy valley, closed in on all sides by barren slopes, apart from the Officer and the Traveler ther e were present only the Condemned, a vacant -looking man with a b road mouth and dilapidated hair and face, and the Soldier, who held the heavy chain to which were connected the small chains which bound the Condemned Man by his feet and wrist bones, as well as by his neck, and which were also linked to each other by connecting chains.
4  ->  The Condemned Man had an expression of such dog -like resignation that it looked as if one could set him free to roam around the slopes and would only have to whistle at the start of the execution for him to return.
5  ->  The Traveler had little i nterest in the apparatus and walked back and forth behind the Condemned Man, almost visibly indifferent, while the Officer took care of the final preparations.
#

so this are before converting into embeddings

dusky iron
#

yeah. so embeddings are not magic

#

you can compute cosine similarity between all of these

hollow grail
#

but the main goal was, to chat with PDF using llama models

dusky iron
#

also, just fyi, vector databases are also returning approximate top results

hollow grail
dusky iron
#

if you want to be able to do that, you will need to do more preprocessing either on the query or the text itself

#

e.g. just adding <title>IN THE PENAL COLONY BY FRANZ KAFKA TRANSLATED BY IAN JOHNSTON 1919 IN THE PENAL COLONY</title> would probably work

#

or using HyDE might work

dusky iron
hollow grail
#

so at current scenario!

-> I am converting sentences ( chunks ) into embeddings
-> storing it in vector database
-> generating relevant docs ( which is getting wrong )
-> and then adding this as data to the query for llama model

#

but HyDE seems to be interesting

#

but the goal should remain same
that we have to chat with PDF's

dusky iron
hollow grail
#

so user will just upload the PDF and then start to question/answer with it ( using llama model )

dusky iron
#

great. well, then you need to improve every single part of your pipeline

#

because most of this stuff works roughly 50% of the time in the default configuration

hollow grail
#

do HyDE also needs embeddings?

#

because here they are just using openAI api

#

also thanks for your time!

#

it uses this

dusky iron
hollow grail
#

got it

#

I should use langchain here

dusky iron
#

only if you want to have more pain

hollow grail
#

come on 🙂

#

then what's the other way?

dusky iron
#

it's called reading

#

this abstract contains the whole technique

#

the whole implementation is in 2 files, with like 20 relevant lines of code at most

hollow grail
#

wait, what is generated document?

#

and how it is being generated?
is this due to Contriever?

dusky iron
#

in its default form HyDE generates a document to answer your query

#

which is not what you want at all

#

you can use langchain blindly following braindead medium articles if you want, but you will only encounter pain ;)

hollow grail
#

ohh so whats the best way?

dusky iron
#

thinking

#

for yourself

#

reading

#

and understanding what the code you are using is doing

hollow grail
dusky iron
#

so that it can embed them

#

average the embeddings

#

and use that average embedding as a surrogate query

#

Q -> Q*

#

and search for that query instead

hollow grail
#

ahh I am still confuse about this

dusky iron
#

if you embed A COMPREHENSIVE OVERVIEW OF MACHINE LEARNING APPLICATIONS, you are much more likely to find 0 -> IN THE PENAL COLONY BY FRANZ KAFKA TRANSLATED BY IAN JOHNSTON 1919 IN THE PENAL COLONY “It’s a peculiar apparatus,” said the Officer to the Traveler, gazing with a certain admiration at the device, with which he was, of course, thoroughly familiar.

hollow grail
#

which is wrong?

dusky iron
#

which is correct

#

your question was "what is the title of the document?"

#

I embed a hypothetical answer A COMPREHENSIVE OVERVIEW OF MACHINE LEARNING APPLICATIONS

#

and find the real answer

hollow grail
#

do I need to do this everytime?

dusky iron
#

idk, you need to test and see

#

if you want to be able to answer "what is the title of the document?", you need to do something to improve the quality of search

#

I gave you a couple ideas

#

there is no medium article that will teach you how to do it with high accuracy

#

first of all because medium article writers are all slop peddlers nowadays

#

and second because it is actually really hard and people who get to say 90% accuracy for such tasks won't be spilling the beans

hollow grail
#

ohkk

#

but I am just confused, as you said langchain will increase the pain

dusky iron
#

yes

hollow grail
#

can you give me , abstract workflow for this?

dusky iron
#

I already did?

#

you start with a query

hollow grail
#

you gave lots of

dusky iron
#

you generate N hypothetical answers to that query

hollow grail
#

where to start?

dusky iron
#

you average their embeddings out

#

you use that average embedding to search

hollow grail
#

so I should implement this HyDE right? I mean first test it

dusky iron
#

you should start by implementing some kind of test suite, ideally.

#

because testing on 1 query that is difficult for semantic search is kind of a bad idea

#

most people will know what the title is

#

and they will be asking about the contents of the story

dusky iron
#

so start by making a set of questions, documents, and answers

hollow grail
#

like most queries like

give me context of the document
what is in the document

and all

hollow grail
dusky iron
#

summarization is also impossible with semantic search btw

dusky iron
#

find/download/use existing ones in your system idk

hollow grail
#

ohkk , got it
so I do have few PDF's to test

so as you said
create questions ( by my own ) and appropriate answers for it

right?

dusky iron
#

yes

hollow grail
#

and how it will help?

dusky iron
#

and some questions that are irrelevant/can't be answered in the context of the document

#

by giving you an estimate of the quality

#

so that when you change things, you can see whether the quality improved

#

and not do things blindly

hollow grail
#

and what's after that actually?
because this is just manual work ,which will take me few minutes

dusky iron
#

after that you start making changes

hollow grail
#

changes? where?

#

assuming, I have just created some question ( relevant and non-relevant ) and answers too

dusky iron
#

read this

#

it doesn't tell you exactly what to do, but should give you some good ideas

dusky iron
#

you can do it manually, but that will be somewhat painful

#

so ideally you would use a sufficiently smart LLM as a judge

#

or you would just pattern match answers somehow/use semantic similarity on those

#

the end goal is to have a way to evaluate performance of your RAG

#

until you have that, all improvements are largely meaningless, since it is impossible to tell if they even work

#

if you want to evaluate JUST the search, you can instead do something different

#

you can collect a list of (query, document, relevant passage and its location)

#

and then measure the performance of your search

hollow grail
#

ohhk that's cool then

dusky iron
#

in particular, using https://en.wikipedia.org/wiki/Mean_reciprocal_rank or any other method

The mean reciprocal rank is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer: 1 for first place, 1⁄2 for second place, 1⁄3 for third ...

hollow grail
#

so will now, use a document and will try the HyDE method

hollow grail
#

so considering this examples websites, which methods do they use?

#

because , this one also failes to retrieve the title of document which is fine, because it is now able to explain story in the document

dusky iron
#

well if I was doing it from scratch, I would either use docling + custom chunker + semantic search + keyword search

#

or... gemini flash

hollow grail
#

wait what gemini flash?

#

I read it has long context length also?

dusky iron
#

yes