Is there a way to automatically extract chapters from Pdf files? - python

currently I am searching for a way to efficiently extract certain sections from around 500 pdf files.
Concretely, I have around 500 annual reports from which I want to extract the Management Report section.
I tried using a regular expression with the heading name as start position, and the following chapter as an ending position, but it (of course) always just yields the table of contents.
I am happy for any suggestions.

Related

How to extract data from messy PDF file with no standard formatting?

I am working on this PDF file to parse the tabular data out of it. I was hoping to use tabula or PyPDF2 to extract tables out of it but the data in PDF is not stored in tables. So, I chose pdfplumber to extract text out of it. Until now, I am able to read text line by line. But I can not figure out a universal pattern that I can use to extract the pricing list rows which I can store in a pandas dataframe and write to an excel file.
Can you help me if I should construct a regular expression or anything else that I can use to extract the pricing list out of this PDF? Because I can not think of any particular regular expression that would fit the messy nature of data inside the PDF, is there any better approach to take? Or simply it's not possible?
Code
Using the following code, I am able to extract all lines of text but the problem is, one price entry is spread across two rows. Consider current row is where most details about the entry are listed, how can I decide if the previous or next row also has information related to current entry.
If I could somehow figure that out, what might be the right approach to deal with the column values, they can be from 6-13 per line, how can I decide if at this particular location in current line, the column value resides?
import pdfplumber as scrapper
text = []
with scrapper.open('./report.pdf') as pdf:
for page in pdf.pages:
text.append(page.extract_text())
The PDF file I am working with:
https://drive.google.com/file/d/1GtjBf9FcKJCOJVNcGA9mvAshJ6t0oFca/view?usp=sharing
Sample Pictures demonstrating which data should fit in which fields:

Extraction of financial statements from pdf reports

I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems:
1. A specific Financial statement can be on any page in the report. If I were to process hundreds of pdfs, I would have to specify page numbers which takes alot of time. Is there any way through which the scraper knows where the exact statement is?
2. Some reports span over multiple pages and the end result after scraping a pdf isnt what I want
3. Different annual reports have different financial statement formats. Is there any way to process them and change them to a specific standard format?
I would also appreciate if anyone have done something like this and can share examples.
Ps I am working with python and used tabula and Camelot
I had a similar case where the problem was to extract specific form information from pdfs (name, date of birth and so on). I used the tesseract open source software with pytesseract to perform OCR on the files . Since I did not need the whole pdfs, but specific information from them, I designed an algorithm to find the information: In my case I used simple heuristics (specific fields, specific line number and some other domain specific stuff), but you can also use a machine-learning approach and train a classifier which can find the needed text-parts. You could use domain-specific heuristics as well, because I am sure that a financial statement has special vocabulary or some text markers which indicate its beginning/its end.
I hope I could at least give you some ideas how to approach the problem
P.S.: With tesseract you can also process multipage pdfs. To 3) - Machine learning approach would need some samples to learn a good generalization of how a financial statement may look like.

How to do OCR for PDF text extraction WHILE maintaining text structure (header/subtitle/body)

I have been endlessly searching for a tool that can extract text from a PDF while maintaining structure. That is, given a text like this:
Title
Subtitle1
Body1
Subtitle2
Body2
OR
Title
Subtitle1. Body1
Subtitle2. Body2
I want a tool that can output a list of titles, subtitles and bodies. Or, if anybody knows how to do this, that would also be useful :)
This would be easier if these 3 categories would be in the same format, but sometimes the subtitles can be bold, italic, underlined, or a random combination of the 3. Same for the titles. The problem with simple parsing from HTML/PDF/Docx is that these texts have no standard, and so quite often we can encounter sentences divided in several tags (in the case of HTML) and being a really hard to parse. As you can see, the subtitles are not always above a given paragraph or are sometimes in bullet points. So many possible combinations of formatting...
So far I have encountered similar inquiries in here using Tesseract and here using OpenCV, yet none of them quite answer my question.
I know that there are some machine learning tools to extract "Table of Contents" sections from scientific papers, but that also does not cut it.
Does anyone know of a package/library, or if such thing has been implemented yet? Or does anyone know an approach to solve this problem, preferably in Python?
Thank you!
Edit:
The documents I am refering to are 10-Ks from companies, such as this one https://www.sec.gov/Archives/edgar/data/789019/000119312516662209/d187868d10k.htm#tx187868_10
And say, I want to extract Item 7 in a programmatic and structured way as I mentioned above. But not all of them are standardized to do HTML parsing. (The PDF document is just this HTML saved as a PDF)
There are certain tools that can accomplish your requested feature upto a certain extent. By saying "certain extent", I mean that the headings and title font properties will be retained after the OCR conversion.
Take a look at Adobe's Document Cloud platform. It is still in the launch stage and will be launching in early 2020. However, developers can have early access by signing up for the early access program. All the information is available in the following link:
https://www.adobe.com/devnet-docs/dcsdk/servicessdk/index.html
I have personally tried out the service and the outputs seem promising. All heading and title cases get recognised as it is in the input document. The micro service that offers this exact feature is "ExportPDF" service that converts a scanned PDF document to Microsoft Word document.
Sample code is available at: https://www.adobe.com/devnet-docs/dcsdk/servicessdk/howtos.html#export-a-pdf
There is a lot of coding to do here, but let me give you a description of what I would do in Python. This is based on there being some structure in terms of font size and style:
Use the Tesseract OCR software (open source, free), use OEM 1, PSM 11 in Pytesseract
Preprocess your PDF to an image and apply other relevant preprocessing
Get the output as a dataframe and combine individual words into lines of words by word_num
Compute the thickness of every line of text (by the use of the image and tesseract output)
Convert image to grayscale and invert the image colors
Perform Zhang-Suen thinning on the selected area of text on the image (opencv contribution: cv2.ximgproc.thinning)
Sum where there are white pixels in the thinned image, i.e. where values are equal to 255 (white pixels are letters)
Sum where there are white pixels in the inverted image
Finally compute the thickness (sum_inverted_pixels - sum_skeleton_pixels) / sum_skeleton_pixels (sometimes there will be zero divison error, check when the sum of the skeleton is 0 and return 0 instead)
Normalize the thickness by minimum and maximum values
Get headers by applying a threshold for when a line of text is bold, e.g. 0.6 or 0.7
To distinguish between different a title and subtitle, you have to rely on either enumerated titles and subtitles or the size of the title and subtitle.
Calculate the font size of every word by converting height in pixels to height in points
The median font size becomes the local font size for every line of text
Finally, you can categorize titles, subtitles, and everything in between can be text.
Note that there are ways to detect tables, footers, etc. which I will not dive deeper into. Look for research papers like the one's below.
Relevant research papers:
An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles. DOI: 10.1007/978-3-642-40501-3_15.
Image-based logical document structure recognition. DOI: 10.1007/978-3-642-40501-3_15.
I did some research and experiments on this topic, so let me try giving a few of the hints I got from the job, which is still far from perfect.
I haven't found any reliable library to do it, although having the time and possibly the competences (I am still relatively inexperienced in reading other's code) I would have liked checking some of the work out there, one in particular (parsr).
I did reach some decent results in headers/title recognition by applying filters to Tesseract's hOCR output. It requires extensive work, i.e.
OCR the pdf
Properly parse the resulting hOCR, so that you can access its paragraphs, lines and words
Scan each line's height, by splitting their bounding boxes
Scan each word's width and height, again splitting bounding boxes, and keep track of them
Heights are needed to intercept false positives, because line heights are sometimes inflated
Find out the most frequent line height, so that you have a baseline for the general base font
Start by identifying the lines that have height higher than the baseline found in #6
Eliminate false positives checking if there a max height of the line's words that matches the line's one, otherwise use the max word height of each line to compare against the #6 baseline.
Now you have a few candidates, and you want to check that
a. The candidate line does not belong to a paragraph whose other lines do not respect the same height, unless it's the first line (sometimes Tesseract joins the heading with the paragraph).
b. The line does not end with "." or "," and possibly other markers that rule out a title/heading
The list runs quite a bit longer. E.g. you might want to apply also some other criteria
like comparing same word widths: if in a line you find more than a certain number of words (I use >= 50%) that are larger than average -- compared to the same word elsewhere in the document -- you almost certainly have a good candidate header or title. (Titles and headers typically have words that appear also in the document, often multiple times)
Another criteria is checking for all caps lines, and a reinforcement can be single liners (lines that belong to a paragraph with just one line).
Sorry I can't post any code (*), but hopefully you got the gist.
It's not exactly an easy feat and requires a lot of work if you don't use ML. Not sure how much ML would make it faster either, because there's a ton of PDFs out there, and probably the big guys (Adobe, Google, Abbyy, etc) trained their models for quite a while.
(*) My code is in JS, and it's seriously intertwined in a large converting application, which so far I can't post open source. I am reasonably sure you can do the job in Python, although the JS DOM manipulation might be somewhat an advantage there.

Searching for data in a PDF

I've got a PDF file that I'm trying to obtain specific data from.
I've been able to parse the PDF via PyPDF2 into one long string but searching for specific data is difficult because of - I assume - formatting in the original PDF.
What I am looking to do is to retrieve specific known fields and the data that immediately follows (as formatted in the PDF) and then store these in seperate variables.
The PDFs are bills and hence are all presented in the exact same way, with defined fields and images. So what I am looking to do is to extract these fields.
What would be the best way to achieve this?
I've got a PDF file that I'm trying to obtain specific data from.
In general, it is probably impossible (or extremely difficult), and details (than you don't mention) are very important. Study in details the complex PDF specification. Notice that PDF is (more or less accidentally) Turing complete (so your problem is undecidable in general, since equivalent to the halting problem).
For example, a normal human reader could read digits in the document as text, or as a JPEG image, etc. And in practice many PDF documents have such kind of data.... Practically speaking, PDF is an output-only format and is designed for screen displaying and printing, not for extracting data from it.
You need to understand how exactly that PDF file was generated (with what exact software, from what actual data). That could take a lot of time (maybe several years of full time reverse-engineering work) without help.
A much better approach is to contact the person or entity providing that PDF file and negotiate some way of accessing the actual data (or at least get detailed explanation about the generation of that particular PDF file). For example, if the PDF file is computed from some database, you'll better access that database.
Perhaps using metadata or comments in your PDF file might help in guessing how it was generated.
The source of the data might produce various kinds of PDF file. For example, my cheap scanner is able to produce PDF. But your program would have hard time in extracting some numerical data from it (because that kind of PDF is essentially wrapping a pixelated image à la JPEG) and would need to deploy image recognition techniques (i.e. OCR) to do so.

Python - Best way to parse specific, standardized information in PDF documents?

I am trying to parse these PDF "Arms Sale Notification" letters, found here:
http://www.dsca.mil/pressreleases/36-b/36b_index.htm
Here is a specific PDF document example, of a proposed arms sale to Oman:
http://www.dsca.mil/pressreleases/36-b/2013/Oman13-07.pdf
Since I have 600 of these documents, the information I want to extract in the example include the country name (Oman), the list of articles to be sold ("AN/AAQ-24(V) Large Aircraft Infrared Countermeasures (LAIRCM) Systems", the cost of the sale ("$100 million") and the primary contractor ("Northrop Grumman Corporation of Rolling Meadows, Illinois").
What sort of regular expressions or split() function specifications could I use to isolate these pieces of information from a document like this?
You need to read the converted text first to determine the regular expression. PDFs can be quirky about text conversion. I would recommend ReportLabs over pyPDF as the PDF parsing library of choice.

Categories

Resources