#How to approach extracting same data from 40K word documents using RAG?

3 messages · Page 1 of 1 (latest)

wind sparrow
#

I'm a noob in RAG stuff.

I have 40K word documents that I need to first to see if they contain a key phrase like "as X citizen" where X is a given country, then I need to extract the name, previous name (if there exists else put nothing), date of birth, father's name and mother's name. They all contain this information but the text and context varies since each human wrote it in his own style in Romanian or English.

I used llamaIndex + llama.cpp to set up a RAG workflow but I did not manage it to give out anything relevant so i ditched it for Llmware. In Llmware i tried to use chromadb + sqlite with this embedding model: "all-mpnet-base-v2": 300, to build a library and index vectors of the documents and when I try to query the vectorDB/index with the "as X citizen" it retrieves/returns me only the passage of text from the document with "as X citizen" instead of the whole paragraph or the whole document text or it even misses it.

The local LLM i use to feed in the vector query results described above is dragon-yi-answer-tool. But it never got concludent data to test it proactively, sometimes it works sometimes it does not. The prompt (for Romanian) is:
                Extract the following information from the text (if it contains 'ca cetățean român') and provide the response in the specified format:

Response Format:
                {
                Nume: [Name],
                Nume anterior: [Previous Name],
                Data nasterii: [Birth Date],
                Nume tata: [Father's Name],
                Nume mama: [Mother's Name],
                }

Note:
                - The birth date usually appears after the phrase "născut la data de" or "născută la data de".
                - The previous name, if present, appears between "născut"/"născută" and "la data de".
                - The names of the parents usually appear after "fiul"/"fiica", with the first name being the father's and the second name being the mother's.

proud finch
#

i dont get it, you want to get the name + some other information for each person. Everyone wrote an article about themselves and you want an llm to extract the relevant facts.
Does each document contain only one person? or can it be multiple?

For me it does not make sense to use RAG here, why dont you parse all documents by a loop to get the data into the format you want? What is the rag used for?

wind sparrow
# proud finch i dont get it, you want to get the name + some other information for each person...

It can be multiple persons, like one document can say data about their grandma on a bullet (name, birth date citizenship) to prove that the subject is indeed of said citizenship or that he is not of said citizenship then start with his data on another bullet point
in other cases there can be directly only the person of interest specified there with his data only (birth date, father and mother name, name, previous name)
each document contains data about a different person
i used rag to filter all documents by the key phrase (is citizen of X, where X is a country) and to try to extract the paragraph/chunk of text relevant to the subject/person data and ignore the title of the document, date, signatures, names and all the goverment gibberish that you find at the top and bottom of the document, then I tried to feed that result as a query to the LLM
i made the query document by document or on batch on the whole dataset then iterate te result
I tought that using RAG is 10x faster than manually parsing the documents 1 by 1