I'm a noob in RAG stuff.
I have 40K word documents that I need to first to see if they contain a key phrase like "as X citizen" where X is a given country, then I need to extract the name, previous name (if there exists else put nothing), date of birth, father's name and mother's name. They all contain this information but the text and context varies since each human wrote it in his own style in Romanian or English.
I used llamaIndex + llama.cpp to set up a RAG workflow but I did not manage it to give out anything relevant so i ditched it for Llmware. In Llmware i tried to use chromadb + sqlite with this embedding model: "all-mpnet-base-v2": 300, to build a library and index vectors of the documents and when I try to query the vectorDB/index with the "as X citizen" it retrieves/returns me only the passage of text from the document with "as X citizen" instead of the whole paragraph or the whole document text or it even misses it.
The local LLM i use to feed in the vector query results described above is dragon-yi-answer-tool. But it never got concludent data to test it proactively, sometimes it works sometimes it does not. The prompt (for Romanian) is:
Extract the following information from the text (if it contains 'ca cetățean român') and provide the response in the specified format:
Response Format:
{
Nume: [Name],
Nume anterior: [Previous Name],
Data nasterii: [Birth Date],
Nume tata: [Father's Name],
Nume mama: [Mother's Name],
}
Note:
- The birth date usually appears after the phrase "născut la data de" or "născută la data de".
- The previous name, if present, appears between "născut"/"născută" and "la data de".
- The names of the parents usually appear after "fiul"/"fiica", with the first name being the father's and the second name being the mother's.