Text Mining from PDF file using Python

Text Mining from PDF file using Python - python

i have annual report of a company(in .pdf format) and i want to fetch balance sheet and other related report form annual report using python. i tried with PyPDF2 lib but it is extracting highly unstructured text. is there any way??

You should use textract
https://github.com/deanmalmgren/textract
It supports various file types for text extraction.

Your question is not very clear. I understand it as I’ve done a lot of work on extracting from UK annual reports. To explain to others, what you’re asking for sounds straightforward where in reality it’s a nightmare. Annual reports come in PDF format and none of the firms producing them follow any standard which makes it difficult to analyse thise reports even manually. PDFs loose structure when you convert them to text. I have a java tool that reads and detects the structure of UK PDF annual reports (similar to the one your provided in the link). It took me 5 years to come up with a solution that can process up to 95% of all UK annual reports despite the huge differences between them. Have a look: https://github.com/drelhaj/CFIE-FRSE there are links there to papers on how we did it.

Related

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.

The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

Is it possible for a PDF data parser to read PowerPoint PDFs?

I am currently developing a proprietary PDF parser that can read multiple types of documents with various types of data. Before starting, I was thinking about if reading PowerPoint slides was possible. My employer uses presentation guidelines that requires imagery and background designs - is it possible to build a parser that can read the data from these PowerPoint PDFs without the slide decor getting in the way?
So the workflow would basically be this:
At the end of a project, the project report is delivered in the form of a presentation.
The presentation would be converted to PDF.
The PDF would be submitted to my application.
The application would read the slides and create a data-focused report for quick review.
The goal of the application is to cut down on the amount of reading that needs to be done by a significant amount as some of these presentation reports can be many pages long with not enough time in the day.

Parsing PDFs into structured data is always tricky, as the format is geared towards precise printing, rather than ease of editing or data extraction.
Basically, a PDF contains information like "there's a label with such text at such (x,y) position on a certain page", or things like that.
Basically, you will very likely need some heuristics in order to turn that into structured data.
It will basically be a form of scraping.
Search on your favorite search engine for PDF scraping, or something like that, and it would be a good start.
Also, you may want to look at those similar posts:
PDF Data and Table Scraping to Excel
How to extract table as text from the PDF using Python?

A PowerPoint PDF isn't a type of PDF.
There isn't going to be anything natively in the PDF that identifies elements on the page as being 'slide' graphics the originated from a PowerPoint file for example.
You could try building an algorithm that makes decision about content to drop from the created PDF but that would be tricky and seems like the wrong approach to me.
A better approach would be to "Export" the PPT to text first, e.g. in Microsoft PowerPoint Export it to a RTF file so you get all of the text out and use that directly or then convert that to PDF.

Integrate XBRLware to Python 3

I am at the absolutely basic level on Python 3 (everything i currently know is from TheMonkeyLords) and my main focus is to integrate Python 3 with XBRLware so that i can extract financial information from the SEC EDGAR database with accuracy and reliability.
How can i use the xbrlware frame with Python 3? I have absolutely no idea how you can use a frame with Python 3....
Any suggestions on what should I learn or code for me to study, clues etc would be great help!
Thank you

Don't do it. Based on personal experience, it is very difficult to extract useful financial data from XBRL. XBRLWare does work, but there is a lot of work to do afterwards to extract the data into something useful.
XBRL has over 100 definitions of "revenue". Each industry reports differently. Each company makes 100s of filings and you have to string together data from different reports. It's an incredibly frustrating process.
I have used XBRLWare as a Ruby Gem on Windows. (It is no longer supported.) It does "work". It downloads and formats the reports nicely, but it operates as a viewer. Most filings contain two quarters of data. (Probably not the quarters you want either.)
You can use the XBRL viewer on the SEC's website to accomplish the same thing. Or you can go to the company's 10-Qs.
Also, XBRL uses CIK codes for the companies. As far as I know, the SEC doesn't have a central database to match CIK codes to ticker symbols (if you can believe that!). So it can be frustrating to find the companies you want to download.
If you want to download all the XBRL filings, I've been told its like 6TB a month.
You can't pull historical financial data from XBRL either. You have to string two quarters at a time together. So, pull every IBM filing (XBRL is 3 yrs old) and string together all the 10-Qs. XBRL is only three years old for the large accelerated filers, so historical data is limited.
There is a reason why Wall Street still charges $25k/year for financial data. XBRL is very difficult to use and difficult to extract data from.
You could try: XBRLcloud.com or findynamics.com

Python libraries that can tokenize wikipedia pages

I'd like to tokenise out wikipedia pages of interest with a python library or libraries. I'm most interested in tables and listings. I want to be able to then import this data into Postgres or Neo4j.
For example, here are three data sets that I'd be interested in:
How many points each country awarded one another in the Eurovision Song contest of 2008:
http://en.wikipedia.org/wiki/Eurovision_Song_Contest_2008#Final
List of currencies and the countries in which they circulate (a many-to-many relationship):
http://en.wikipedia.org/wiki/List_of_circulating_currencies
Lists of solar plants around the world: http://en.wikipedia.org/wiki/List_of_solar_thermal_power_stations
The source of each of these is written with wikipedia's brand of markup which is used to render them out. There are many wikipedia-specific tags and syntax used in the raw data form. The HTML might almost be the easier solution as I can just use BeautifulSoup.
Anyone know of a better way of tokenizeing? I feel that I'd reinvent the wheel if I took the final HTML and parsing it with BeautifulSoup. Also, if I could find a way to output these pages in XML, the table data might not be tokenized enough and it would require further processing.

Since Wikipedia is built on MediWiki, there is an api you can exploit. There is also Special:Export that you can use.
Once you have the raw data, then you can run it through mwlib to parse it.

This goes more to semantic web direction, but DBPedia allows querying parts (community conversion effort) of wikipedia data with SPARQL. This makes it theoretically straightforward to extract the needed data, however dealing with RDF triples might be cumbersome.
Furthermore, I don't know if DBPedia yet contains any data that is of interest for you.

Databases for running statistical tests

I am trying to apply knowledge i learnt during statistics courses to real world datasets.
I am looking for some real database/tables. Would be helpful if the link to page added as well . Format is not a constraint - i use python and i can well convert to sqlite .
One example would be [one medium sized table] of for identifying country for given ip address : http://ip-to-country.webhosting.info/node/view/6

Well, since your profile says you're from India, I thought some Indian Government statistics would help, so a quick google search yields this site:
http://mospi.nic.in/dwh/index.htm
Click on 'Tables', and you'll have a list of more data/tables than you could possibly need.
...these files all seem to be in Microsoft XLS format, but another quick google search yields a free converter: http://download.cnet.com/XLS-Converter/3000-2077_4-10401513.html
...or you could run a python program, xlrd ( http://pypi.python.org/pypi/xlrd ) and read the files directly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.