#AI-driven Colombian Law Agent

1 messages · Page 1 of 1 (latest)

orchid wing
#

I am trying to build a AI-driven legal support project. The workflow of the project is like below:

  1. user upload a PDF document which contains legal matters
    I attached sample pdf file
  2. The service extracts information like the background, relevant facts, evidences, etc
  3. The extracted information is used to search similar judicial precedent which is stored in mongodb.
    The data is scraped from the websites like https://www.corteconstitucional.gov.co/relatoria/2024/T-310-24.htm
    I scraped nearly 40k data.
  4. The service generate a new document with the extracted information based on the searched previous judgement.

Now I want to build this project with RAGs or Langchain. I want to build local LLM. But until now, I have no clear idea how to handle it. I need your help.

scarlet parcel
#

Like many before you, I would advise you NOT use langchain, primarily because your data sources are not well centralized. You'll also find that downloading the data may take significant resources.

orchid parcel
#

Building a local LLM can be challenging. However, utilizing an existing open-source LLM like LLama 3.3 70B and running it locally could be a cost-effective solution.

orchid wing
orchid wing
scarlet parcel
# orchid wing could you please guide me how to implement this task?

The components of RAG are pretty simple. You have your data (organized in some way) stored locally on your machine. You also need a vectorDB (I used lancedb) for when you store vector embeddings. And you'll need an embedding model (ie BERT or sentence-transformers) to convert text into dense vector embeddings.

#

Given the scale of your legal data, I'd recommend you use a standard 2-stage approach to your search engine: stage 1 being a TF-IDF search engine and stage 2 being your vector DB.

#

Would also highly recommend you precompute the TF-IDF stuff (ie text frequencies for each document and inverse document frequencies for each word), and implement the TF-IDF search in something fast like Rust over python or nodejs.

orchid wing
#

many thanks. i am familiar with python. and will develop as you guide.

scarlet parcel
#

Python is great for rapid prototyping and scripting the preprocessing, BUT depending on the scale (number of documents and length of documents), you may want to consider faster alternatives OR speeding up parts of your Python with rust extensions.

#

Also, the reason I shy people away from langchain is because it's a library that is constantly evolving with documentation becoming out of date real quick. When you dissect their code in GitHub, you find so many layers of abstraction that it's tough to follow where everything goes. It's just much easier to build your own RAG system than to rely on langchain.

To summarize: Langchain is good for people who want to have a toy project they can put on GitHub/LinkedIn and call it a day. Make your own RAG system if you are passionate or invested in a project and understanding how RAG works.

orchid wing
#

thanks, now i can start the work. I am planning to develop like below:

  1. build vectordb with exiting database
  2. build a rag system to to search and retrieve relevant data

but i have questions for you

  1. is there any local vector db or should i use cloud service like pinecone?
  2. when I retrieve data with RAGs, is the data a single record that is most relevant with uploaded document or a combination of multiple records that treated the similar issues?
orchid wing
# scarlet parcel Python is great for rapid prototyping and scripting the preprocessing, BUT depen...

thanks, now i can start the work. I am planning to develop like below:

  1. build vectordb with exiting database
  2. build a rag system to to search and retrieve relevant data

but i have questions for you

  1. is there any local vector db or should i use cloud service like pinecone?
  2. when I retrieve data with RAGs, is the data a single record that is most relevant with uploaded document or a combination of multiple records that treated the similar issues?
scarlet parcel
#

Regarding what is retrieved, usually what is fed into the LLM is actual text chunks that are providing some relevant context. Here's how this looks like with a two-stage RAG system:

  1. Query goes to TF-IDF/BM25 search engine. Search returns the top N relevant documents based on that method
  2. Those top N documents are turned in vectors via an embedding model and stored in a vector DB. Depending on the size of the data (and amount of storage available to you), it may be better to compute those embeddings on the fly instead of precompute.
  3. Anyways, the Query goes to the vector db, filtered down to only look at vectors from the top N documents from stage 1. The query is embedded and we use vector similarity once again to get the top M embeddings. Given the embeddings (and associated text metadata) from the vector db, that is turned into the actual text chunks from your knowledge base (in other words, your vector DB should hold embeddings and info about the original text that was embedded so you can retrieve what exactly each text chunk was for that particular embedding).
  4. Those top M texts are returned to the LLM and everything is passed into one large template prompt that looks something like:

"Given the following context:
{context text 1}
{context text 2}
{context text M}

Answer the following question:
{user entered query}"

#

If your model does not have sufficient context window for processing the data, you would probably set M = 1. If your model can handle larger contexts, M can be larger.

#

It is advised that you have M > 1 in case your retrieval did not return the "needed"/"correct" text chunk in the first result.