#Extract text using X and Y coordinates

27 messages · Page 1 of 1 (latest)

kind wadi
#

Hello everyone, sorry for bothering you.
I'm having a problem, please I need some help.
I'm using a library called ng2-pdf-viewer in order to view my pdf documents within my web application.
My requirement is to upload a PDF file and extract text from specific areas selected by the user.
Looking for the internet I found a way to draw a rectangle and store all the coordinates inside it, but these rectangles are only visual as they don't interact with the PDF.
So my question is:
Is there a way to extract the text inside these rectangles?
Since I have saved the X and Y coordinates of these rectangles (Second image).

Please help

twilit orchid
#

Don't know that library, but a quick glance at its README gave me this

opaque hollow
#

my gut feeling thinks this is not possible. As far as I know there are no metadata that tells where the text starts and ends based on coordinates

rigid tusk
#

Also keep in mind that some pdfs are just images.

kind wadi
kind wadi
#

it's a bit tricky but I think it will work

opaque hollow
#

usually those PDF renderer generates a canvas. Then you can crop your canvas and send it to an OCR library

azure geyser
#

If the user is selecting text maybe use getSelection().toString()?

kind wadi
#

I found a library called tesseract.js that works as OCR

kind wadi
kind wadi
azure geyser
kind wadi
azure geyser
kind wadi
azure geyser
#

you can have as many text selections as you want

#

keep in mind that this only selects text, if the user clicks those selections go away, so you probably want to do some DOM transform to wrap selected text in a <mark> tag or something

kind wadi
azure geyser
#

that's fine, selections can span multiple elements

kind wadi
#

Sorry for being so redundant
But how can I position those selection in the HTML in order to wrap them?

azure geyser
#

when you use the Selection/Range API you can extract the contents of the selection, iterate over them, and transform the dom

kind wadi
#

Ohh I see
This is exactly what I'm looking for

I will do some tests and read the documentation of this range library
Thank you, your information helps me a lot