Python - Best way to parse specific, standardized information in PDF documents? - python

I am trying to parse these PDF "Arms Sale Notification" letters, found here:
http://www.dsca.mil/pressreleases/36-b/36b_index.htm
Here is a specific PDF document example, of a proposed arms sale to Oman:
http://www.dsca.mil/pressreleases/36-b/2013/Oman13-07.pdf
Since I have 600 of these documents, the information I want to extract in the example include the country name (Oman), the list of articles to be sold ("AN/AAQ-24(V) Large Aircraft Infrared Countermeasures (LAIRCM) Systems", the cost of the sale ("$100 million") and the primary contractor ("Northrop Grumman Corporation of Rolling Meadows, Illinois").
What sort of regular expressions or split() function specifications could I use to isolate these pieces of information from a document like this?

You need to read the converted text first to determine the regular expression. PDFs can be quirky about text conversion. I would recommend ReportLabs over pyPDF as the PDF parsing library of choice.

Related

Is there a way to automatically extract chapters from Pdf files?

currently I am searching for a way to efficiently extract certain sections from around 500 pdf files.
Concretely, I have around 500 annual reports from which I want to extract the Management Report section.
I tried using a regular expression with the heading name as start position, and the following chapter as an ending position, but it (of course) always just yields the table of contents.
I am happy for any suggestions.

Extraction of financial statements from pdf reports

I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems:
1. A specific Financial statement can be on any page in the report. If I were to process hundreds of pdfs, I would have to specify page numbers which takes alot of time. Is there any way through which the scraper knows where the exact statement is?
2. Some reports span over multiple pages and the end result after scraping a pdf isnt what I want
3. Different annual reports have different financial statement formats. Is there any way to process them and change them to a specific standard format?
I would also appreciate if anyone have done something like this and can share examples.
Ps I am working with python and used tabula and Camelot
I had a similar case where the problem was to extract specific form information from pdfs (name, date of birth and so on). I used the tesseract open source software with pytesseract to perform OCR on the files . Since I did not need the whole pdfs, but specific information from them, I designed an algorithm to find the information: In my case I used simple heuristics (specific fields, specific line number and some other domain specific stuff), but you can also use a machine-learning approach and train a classifier which can find the needed text-parts. You could use domain-specific heuristics as well, because I am sure that a financial statement has special vocabulary or some text markers which indicate its beginning/its end.
I hope I could at least give you some ideas how to approach the problem
P.S.: With tesseract you can also process multipage pdfs. To 3) - Machine learning approach would need some samples to learn a good generalization of how a financial statement may look like.

Methods to extract keywords from large documents that are relevant to a set of predefined guidelines using NLP/ Semantic Similarity

I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.
What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.

How to extract numbers from PDF?

I want to extract numbers from a PDF file. I want to create a histogram depicting the scores of students who got approved by an university; these scores are stored in a PDF file. What are some ways I can extract them?
You first need a PDF parser since Python by default is not capable of reading it. A SO answer posted here Python module for converting PDF to text suggested to use PDFMINER for it - http://www.unixuser.org/~euske/python/pdfminer/index.html
However youve not provided any examples of how the numbers are represented. You need to make some kind of a custom line parser using regex/patterns to define rules how to extract these numbers. The difficulty mainly depends if the PDF contains only raw statistical data, if not, you also need to be careful not to take in all numbers, that is the ones that actually do not refer to any statistical data but are just in a sentence.
You can learn more about regular expressions in python from here https://docs.python.org/3/library/re.html
If regex is new to you, you can learn and experiment with it here
http://regexr.com/ .

Hand tagging a training set with customized tags

I would like to perform some natural language processing on cooking recipes, in particular the ingredients (perhaps preparation later on). Basically I am looking to create my own set of POS tags to help me determine the meaning of an ingredient line.
For example, if one of the ingredients was:
3/4 cup (lightly packed) flat-leaf parsley leaves, divided
I would want tags to express the ingredient being listed and the quanitity, which is usually a number followed by some unit of measurement. For example:
3\NUM-QTY/\FRACTION4\NUM-QTY cup\N-MEAS (lightly\ADV packed\VD) [flat-leaf\ADJ parsley\N]\INGREDIENT leaves\N, divided\VD
The tags I found here.
I am uncertain about a few things:
Should I be using custom tags, or should I be doing some sort of post-tagging processing after using a pre-existing tagger?
If I do use custom tags, is the best way to make a training text to just go through an ingredient list and tag everything by hand?
I feel like this language processing is so specific that it would be beneficial to train a tagger on an applicable set, but I'm not exactly sure how to proceed.
Thanks!
Use pattern.search library.
The python pattern library supports many tags[1] , including a cardinal number tag(CD).
Once you have tagged cardinals , fractions are "cardinal/cardinal" or something like "cardinal cardinal/cardinal".
And regarding quantities , you should build a taxonomy of cooking quantities. the python pattern library also support lemmatization[2].
I think using pattern.search[2] you could build a Constraint that would fit your data, and do pattern searches on text using it.
[1]http://www.clips.ua.ac.be/pages/mbsp-tags
[2]http://www.clips.ua.ac.be/pages/pattern-search

Categories

Resources