Extraction of financial statements from pdf reports

Extraction of financial statements from pdf reports - python

I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems:
1. A specific Financial statement can be on any page in the report. If I were to process hundreds of pdfs, I would have to specify page numbers which takes alot of time. Is there any way through which the scraper knows where the exact statement is?
2. Some reports span over multiple pages and the end result after scraping a pdf isnt what I want
3. Different annual reports have different financial statement formats. Is there any way to process them and change them to a specific standard format?
I would also appreciate if anyone have done something like this and can share examples.
Ps I am working with python and used tabula and Camelot

I had a similar case where the problem was to extract specific form information from pdfs (name, date of birth and so on). I used the tesseract open source software with pytesseract to perform OCR on the files . Since I did not need the whole pdfs, but specific information from them, I designed an algorithm to find the information: In my case I used simple heuristics (specific fields, specific line number and some other domain specific stuff), but you can also use a machine-learning approach and train a classifier which can find the needed text-parts. You could use domain-specific heuristics as well, because I am sure that a financial statement has special vocabulary or some text markers which indicate its beginning/its end.
I hope I could at least give you some ideas how to approach the problem
P.S.: With tesseract you can also process multipage pdfs. To 3) - Machine learning approach would need some samples to learn a good generalization of how a financial statement may look like.

Related

OCR PDF image to Excel by template

I need to convert a lot PDF tables data scans with bad quality to excel tables. The only way I see the solution is to train tesseract or some other framework on pre-generated images(all tables in PDF are the same in most cases). Is it real to have a great solution around 70-80% at home conditions and what you can advice. I will appreciate any advice other than Abby FineReader or similar solution(tested on my dataset - result is so bad and few opportunities for automation)
All tables structures need to be correct in result for further handwork.

You should use a PDF parser for that.
Here's the parsed result using Parsio (https://parsio.io). It looks correct to me. You can export the parsed data to Sheets / Excel / CSV / Zapier.

When the input image is a very poor quality the dirt tends to get in the way of text recognition. This is exacerbated when trying to look for areas without dictionary entries, thus only numbers can be the worst type of text to train, for every twist and turn that bad scanning produces.
If the electronic source before manual stamp and scan is available it might be possible to meld the text with the distorted image , but its a highly manual task defeating the aim.
The docents need to be rescanned, by a trained operator, with a good eye for details. That, with an OCR scan device, will be faster than tuning images that are never likely to provide a reasonably trustworthy output. There are too many cases of numeric fails, that would make any single page worthless for reading or computations.
Recently scanned some accounts and spent more time check/correct than if it had been typed, but it needed to be "legal" copy, however clearly it was not as I did it after the event.
The best result I could squeeze from Adobe PDF to Excel was "Pants"

There are some improvements in image contrast and noise reduction (handwork).
Some effect but not obvious.
Image2word

Split pack of text files into multiple subsets according to the content of the files

I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.

This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.

Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)

Searching for data in a PDF

I've got a PDF file that I'm trying to obtain specific data from.
I've been able to parse the PDF via PyPDF2 into one long string but searching for specific data is difficult because of - I assume - formatting in the original PDF.
What I am looking to do is to retrieve specific known fields and the data that immediately follows (as formatted in the PDF) and then store these in seperate variables.
The PDFs are bills and hence are all presented in the exact same way, with defined fields and images. So what I am looking to do is to extract these fields.
What would be the best way to achieve this?

I've got a PDF file that I'm trying to obtain specific data from.
In general, it is probably impossible (or extremely difficult), and details (than you don't mention) are very important. Study in details the complex PDF specification. Notice that PDF is (more or less accidentally) Turing complete (so your problem is undecidable in general, since equivalent to the halting problem).
For example, a normal human reader could read digits in the document as text, or as a JPEG image, etc. And in practice many PDF documents have such kind of data.... Practically speaking, PDF is an output-only format and is designed for screen displaying and printing, not for extracting data from it.
You need to understand how exactly that PDF file was generated (with what exact software, from what actual data). That could take a lot of time (maybe several years of full time reverse-engineering work) without help.
A much better approach is to contact the person or entity providing that PDF file and negotiate some way of accessing the actual data (or at least get detailed explanation about the generation of that particular PDF file). For example, if the PDF file is computed from some database, you'll better access that database.
Perhaps using metadata or comments in your PDF file might help in guessing how it was generated.
The source of the data might produce various kinds of PDF file. For example, my cheap scanner is able to produce PDF. But your program would have hard time in extracting some numerical data from it (because that kind of PDF is essentially wrapping a pixelated image à la JPEG) and would need to deploy image recognition techniques (i.e. OCR) to do so.

Sentiment analysis of various lines of data

I'm new to programming and do not have much experience yet. I understand some python codes, but not into detail.
I have an Excel file which contains log files of problems people encountered. The description of the problem is pasted as an email (so it's a bunch of text). I want to analyze all of these texts (almost 1.000 rows in Excel) at once, and I think Python can do this.
The type of analysis I want to do is sentiment analysis (positive, neutral, negative) or I want to see the main problem out of the text. I don't know if the second one is possible.
I copied the emails that are listed in the Excel file, to a .txt file, so now every rule is one message. How can I use Python to analyze every single rule as one message and let it show me the sentiment or the main problem?
I'd appreciate the help

Sentiment analysis is a fairly large problem in computer science/language. How specific did you want to get?
I'd recommend looking into Text-Processing for simple SA.
Their API docs are here, http://text-processing.com/docs/sentiment.html, which will return a simple pos and neg score for your text.
If you want anything more specific, I'd recommend looking into the IBM Watson, specifically Natural Language Understanding https://www.ibm.com/watson/developercloud/natural-language-understanding.html

split a table in an image into rows by whitespace using computer vision applications

I am trying to solve what I have realized is quite a hard problem to address due to my lack of expertise in the subject. Suppose I have an image of a table with 3 rows and 5 columns. Each row contains text (let's assume only english for now) or numbers (normal Indo-Arabic numerals). There is nothing but whitespace between the columns and between each row. Now assuming all rows and all columns are aligned, my task would be to get an algorithm to recognize and extract each row out from the document (don't know if I'm articulating this well enough).
Could someone suggest a good starting point (library , similar example , textbook chapter that deals with something like this) etc.. for me to get started.
My background is data science but I have just never been exposed to computer vision.
Any help would be appreciated.

You should start off with OpenCV, like Racialz suggested. This tool contains a Hough lines/Hough transform method which should be the primary and easiest way for you to find and crop text from table sections. There are many different tasks for lines to find for which people use this algorythm (like THIS or THIS), but with your task it would be much easier, because lines should be much clearer and simplier, rather than in these examples. After you do your extraction, you then will need to scan your text, for this I would suggest you using tesseract ocr engine. This engine is for free, really easy to use, it provides pretty decent results and allows you to train it to scan specific types of letters.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.