Hi All,
I'm fairly new to NLP, I have a pdf file with both text and multiple tables with varying column lengths. I tried using a couple of libraries to assist with efficiently extracting both e.g. LangChain's PDFLoader and PyPDF2.
However, the former was very unstructured and the latter sometimes considered text as tabular or vice versa. My concern is if it's highly unstructured, how would the model know what a table row is related to when looking up the similarity (LangChain's loader)
How are you guys extracting data from multiple data formats to create embeddings to then store in some vector db?