#How can we read images/tables in PDF and csv/excel files using langchain?
7 messages · Page 1 of 1 (latest)
Yes you can use PyPDFLoader and it is well documented https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf
- Load with PyPDF
- Split document using RecursiveTextSplitter
- Store them in a vector store (FAISS, Chroma, etc) with embeddings
- Retrieve from the vector store
I did that but it doesnt work for complex pdfs that include tables and images
you can convert the pdf into markdown then use markdown document loaders, i dont know about the tables part, but it should work for the images
yes i can confirm it work with tables