#Extracting text and tabular data from a PDF efficiently for NLP

5 messages · Page 1 of 1 (latest)

rough dock
#

Hi All,
I'm fairly new to NLP, I have a pdf file with both text and multiple tables with varying column lengths. I tried using a couple of libraries to assist with efficiently extracting both e.g. LangChain's PDFLoader and PyPDF2.
However, the former was very unstructured and the latter sometimes considered text as tabular or vice versa. My concern is if it's highly unstructured, how would the model know what a table row is related to when looking up the similarity (LangChain's loader)

How are you guys extracting data from multiple data formats to create embeddings to then store in some vector db?

waxen sluice
#

@rough dock were you able to figure it out?

wintry cave
serene island
# rough dock Hi All, I'm fairly new to NLP, I have a pdf file with both text and multiple ta...

Just a guess, but since you are mentioning a vector db, I wonder if you are conflating embedding (vectorization before feeding into a model) with classification/summarisation (vectors representing text produced by a model). If latter, then you either need to preprocess the text so that it can be processed by the model you have, or find/train a model that's resilient to formatting peculiarities.

waxen sluice