Extracting PDF text by subjects [closed]

Extracting PDF text by subjects [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm trying to extract the text from PDF by subjects.
in order to do so im trying to identify the labels \ headlines in the PDF.
So far I have converted the PDF into xml file, in order to get the text data more easily, and then using the font \ size of each in to deiced if a line is a label or not.
the main problem with this way, is that each PDF can have its own build, and not necessarily what works for one PDF will work for the other.
I will be glad if someone have an idea how to overcome this problem so that it will be possible to extract the labels (text by subjects) without depending on the PDF (most of the PDFs I work with are articles \ books)
different ways to extract text by subjects also welcome.
(As the tag indicates, I'm trying to do this in Python)
Edit:
At the moment im doing 2 things:
check font of each line
check each line text size
i concluded that: regular text will have the most lines with its font (there are more than x10 lines with this font than all other texts), and that if you look at the median of text size, it will be the size of the regular text.
From the first i can remove all regular text, and from the second i can take all texts that are bigger and all the labels will be in this list.
The problem now is to extract only the labels from this list since usually there is text that is bigger than the regular text yet isn't a label.
I tried to use the amount of time each fonts shows in the text to identify the labels fonts, but without much success. For each PDF the amount can vary.
I'm looking for ideas how to solve this problem, or if someone know a tools that can do it more easily.

I would suggest studying many pdfs and write down every pdf label text size. Then, you can average the top 5 highest fonts and average the top 5 lowest fonts. Now, you can make a range between them and check text if it is in that text size range.
This method will not work always, but, will cover the majority of pdfs.
(The more pdfs you study the better)

Related

Is there a way to automatically extract chapters from Pdf files?

currently I am searching for a way to efficiently extract certain sections from around 500 pdf files.
Concretely, I have around 500 annual reports from which I want to extract the Management Report section.
I tried using a regular expression with the heading name as start position, and the following chapter as an ending position, but it (of course) always just yields the table of contents.
I am happy for any suggestions.

How to extract particular block of text from pdf file with Python?

I wanted to extract blocks of text as image (here the questions and options in pdf) from pdf and then store each block of text as a image separately.
The length of questions varies a lot, so using pyPDF2 for splitting the pdf at regular intervals is out of question.
Any suggestion? I have read few posts that mention using OCR to get question numbers and then splitting but I didn't completely get what they were trying to say. Is there any easier method?

How to extract charts/tables/graphs from PDF files using Python?

Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.
I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.
Can anyone help me with an efficient python 3.6 code to solve the same?
Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.
Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?
https://hartmannazurecdn.azureedge.net/media/2369/annual-report-2017.pdf

For extracting tables you can use camelot
Here is an article about it.
For images I've found this question and answer Extract images from PDF without resampling, in python?

Try using PyMuPdf(https://github.com/pymupdf/PyMuPDF/tree/1.18.3) for amalgamation of texts, bars, lines and axis. It has so many extra utilities.

How to do OCR for PDF text extraction WHILE maintaining text structure (header/subtitle/body)

I have been endlessly searching for a tool that can extract text from a PDF while maintaining structure. That is, given a text like this:
Title
Subtitle1
Body1
Subtitle2
Body2
OR
Title
Subtitle1. Body1
Subtitle2. Body2
I want a tool that can output a list of titles, subtitles and bodies. Or, if anybody knows how to do this, that would also be useful :)
This would be easier if these 3 categories would be in the same format, but sometimes the subtitles can be bold, italic, underlined, or a random combination of the 3. Same for the titles. The problem with simple parsing from HTML/PDF/Docx is that these texts have no standard, and so quite often we can encounter sentences divided in several tags (in the case of HTML) and being a really hard to parse. As you can see, the subtitles are not always above a given paragraph or are sometimes in bullet points. So many possible combinations of formatting...
So far I have encountered similar inquiries in here using Tesseract and here using OpenCV, yet none of them quite answer my question.
I know that there are some machine learning tools to extract "Table of Contents" sections from scientific papers, but that also does not cut it.
Does anyone know of a package/library, or if such thing has been implemented yet? Or does anyone know an approach to solve this problem, preferably in Python?
Thank you!
Edit:
The documents I am refering to are 10-Ks from companies, such as this one https://www.sec.gov/Archives/edgar/data/789019/000119312516662209/d187868d10k.htm#tx187868_10
And say, I want to extract Item 7 in a programmatic and structured way as I mentioned above. But not all of them are standardized to do HTML parsing. (The PDF document is just this HTML saved as a PDF)

There are certain tools that can accomplish your requested feature upto a certain extent. By saying "certain extent", I mean that the headings and title font properties will be retained after the OCR conversion.
Take a look at Adobe's Document Cloud platform. It is still in the launch stage and will be launching in early 2020. However, developers can have early access by signing up for the early access program. All the information is available in the following link:
https://www.adobe.com/devnet-docs/dcsdk/servicessdk/index.html
I have personally tried out the service and the outputs seem promising. All heading and title cases get recognised as it is in the input document. The micro service that offers this exact feature is "ExportPDF" service that converts a scanned PDF document to Microsoft Word document.
Sample code is available at: https://www.adobe.com/devnet-docs/dcsdk/servicessdk/howtos.html#export-a-pdf

There is a lot of coding to do here, but let me give you a description of what I would do in Python. This is based on there being some structure in terms of font size and style:
Use the Tesseract OCR software (open source, free), use OEM 1, PSM 11 in Pytesseract
Preprocess your PDF to an image and apply other relevant preprocessing
Get the output as a dataframe and combine individual words into lines of words by word_num
Compute the thickness of every line of text (by the use of the image and tesseract output)
Convert image to grayscale and invert the image colors
Perform Zhang-Suen thinning on the selected area of text on the image (opencv contribution: cv2.ximgproc.thinning)
Sum where there are white pixels in the thinned image, i.e. where values are equal to 255 (white pixels are letters)
Sum where there are white pixels in the inverted image
Finally compute the thickness (sum_inverted_pixels - sum_skeleton_pixels) / sum_skeleton_pixels (sometimes there will be zero divison error, check when the sum of the skeleton is 0 and return 0 instead)
Normalize the thickness by minimum and maximum values
Get headers by applying a threshold for when a line of text is bold, e.g. 0.6 or 0.7
To distinguish between different a title and subtitle, you have to rely on either enumerated titles and subtitles or the size of the title and subtitle.
Calculate the font size of every word by converting height in pixels to height in points
The median font size becomes the local font size for every line of text
Finally, you can categorize titles, subtitles, and everything in between can be text.
Note that there are ways to detect tables, footers, etc. which I will not dive deeper into. Look for research papers like the one's below.
Relevant research papers:
An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles. DOI: 10.1007/978-3-642-40501-3_15.
Image-based logical document structure recognition. DOI: 10.1007/978-3-642-40501-3_15.

I did some research and experiments on this topic, so let me try giving a few of the hints I got from the job, which is still far from perfect.
I haven't found any reliable library to do it, although having the time and possibly the competences (I am still relatively inexperienced in reading other's code) I would have liked checking some of the work out there, one in particular (parsr).
I did reach some decent results in headers/title recognition by applying filters to Tesseract's hOCR output. It requires extensive work, i.e.
OCR the pdf
Properly parse the resulting hOCR, so that you can access its paragraphs, lines and words
Scan each line's height, by splitting their bounding boxes
Scan each word's width and height, again splitting bounding boxes, and keep track of them
Heights are needed to intercept false positives, because line heights are sometimes inflated
Find out the most frequent line height, so that you have a baseline for the general base font
Start by identifying the lines that have height higher than the baseline found in #6
Eliminate false positives checking if there a max height of the line's words that matches the line's one, otherwise use the max word height of each line to compare against the #6 baseline.
Now you have a few candidates, and you want to check that
a. The candidate line does not belong to a paragraph whose other lines do not respect the same height, unless it's the first line (sometimes Tesseract joins the heading with the paragraph).
b. The line does not end with "." or "," and possibly other markers that rule out a title/heading
The list runs quite a bit longer. E.g. you might want to apply also some other criteria
like comparing same word widths: if in a line you find more than a certain number of words (I use >= 50%) that are larger than average -- compared to the same word elsewhere in the document -- you almost certainly have a good candidate header or title. (Titles and headers typically have words that appear also in the document, often multiple times)
Another criteria is checking for all caps lines, and a reinforcement can be single liners (lines that belong to a paragraph with just one line).
Sorry I can't post any code (*), but hopefully you got the gist.
It's not exactly an easy feat and requires a lot of work if you don't use ML. Not sure how much ML would make it faster either, because there's a ton of PDFs out there, and probably the big guys (Adobe, Google, Abbyy, etc) trained their models for quite a while.
(*) My code is in JS, and it's seriously intertwined in a large converting application, which so far I can't post open source. I am reasonably sure you can do the job in Python, although the JS DOM manipulation might be somewhat an advantage there.

Using set of images as an input

I am trying to take a set of tagged images as input and remove noisy tags from each image. I have written a python code which takes a text file as input, each line of which corresponds to tags of an image, and removes the noisy tags from each line. I have used the structure and similarity measures in Wordnet to do so.
But i dunno how to take a set of images as input. Could someone please tell me how to do this.

Well, I suppose you take a bunch of path to images as strings, you can get them as command line arguments or reading them from stdin. Then you just have to open them or do whatever you want with it.
Is that answering your question?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.