Extracting text and tabular data from a PDF efficiently for NLP | Learn AI Together | Page 1

rough dock Apr 13, 2023, 7:06 AM

#

Hi All,
I'm fairly new to NLP, I have a pdf file with both text and multiple tables with varying column lengths. I tried using a couple of libraries to assist with efficiently extracting both e.g. LangChain's PDFLoader and PyPDF2.
However, the former was very unstructured and the latter sometimes considered text as tabular or vice versa. My concern is if it's highly unstructured, how would the model know what a table row is related to when looking up the similarity (LangChain's loader)

How are you guys extracting data from multiple data formats to create embeddings to then store in some vector db?

waxen sluice Apr 27, 2023, 5:02 PM

#

@rough dock were you able to figure it out?

wintry cave Apr 27, 2023, 6:02 PM

#

rough dock Hi All, I'm fairly new to NLP, I have a pdf file with both text and multiple ta...

It’s not yet clear to me your ask, specifically about relationship

serene island Apr 27, 2023, 9:35 PM

#

rough dock Hi All, I'm fairly new to NLP, I have a pdf file with both text and multiple ta...

Just a guess, but since you are mentioning a vector db, I wonder if you are conflating embedding (vectorization before feeding into a model) with classification/summarisation (vectors representing text produced by a model). If latter, then you either need to preprocess the text so that it can be processed by the model you have, or find/train a model that's resilient to formatting peculiarities.

waxen sluice Apr 28, 2023, 1:48 AM

#

serene island Just a guess, but since you are mentioning a vector db, I wonder if you are conf...

I think they're having trouble in getting the data out of the pdf and arranging it in a proper format for using it later.

Like they've said that the pdf has textual + tabular data and the data extractor they're using is sometimes misreading textual data as tabular.

#Extracting text and tabular data from a PDF efficiently for NLP