How to extract numbers from PDF?

How to extract numbers from PDF? - python

I want to extract numbers from a PDF file. I want to create a histogram depicting the scores of students who got approved by an university; these scores are stored in a PDF file. What are some ways I can extract them?

You first need a PDF parser since Python by default is not capable of reading it. A SO answer posted here Python module for converting PDF to text suggested to use PDFMINER for it - http://www.unixuser.org/~euske/python/pdfminer/index.html
However youve not provided any examples of how the numbers are represented. You need to make some kind of a custom line parser using regex/patterns to define rules how to extract these numbers. The difficulty mainly depends if the PDF contains only raw statistical data, if not, you also need to be careful not to take in all numbers, that is the ones that actually do not refer to any statistical data but are just in a sentence.
You can learn more about regular expressions in python from here https://docs.python.org/3/library/re.html
If regex is new to you, you can learn and experiment with it here
http://regexr.com/ .

Related

How can I Convert a dataset to glove or word2vec format?

I have my twitter archive downloaded and wanted to run word2vec to experiment most similar words, analogies etc on it.
But I am stuck at first step - how to convert a given dataset / csv / document so that it can be input to word2vec? i.e. what is the process to convert data to glove/word2vec format?

Typically implementations of the word2vec & GLoVe algorithms do one or both of:
accept a plain text file, where tokens are delimited by (one or more) spaces, and text is considered each newline-delimited line at a time (with lines that aren't "too long" - usually, short-article or paragraph or sentence per line)
have some language/library-specific interface for feeding texts (lists-of-tokens) to the algorithm as a stream/iterable
The Python Gensim library offers both options for its Word2Vec class.
You should generally try working through one or more tutorials to get a working overview of the steps involved, from raw data to interesting results, before applying such libraries to your own data. And, by examining the formats used by those tutorials – and the extra steps they perform to massage the data into the formats needed by exactly the libraries you're using – you'll also see ideas for how your data needs to be prepared.

Extraction of financial statements from pdf reports

I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems:
1. A specific Financial statement can be on any page in the report. If I were to process hundreds of pdfs, I would have to specify page numbers which takes alot of time. Is there any way through which the scraper knows where the exact statement is?
2. Some reports span over multiple pages and the end result after scraping a pdf isnt what I want
3. Different annual reports have different financial statement formats. Is there any way to process them and change them to a specific standard format?
I would also appreciate if anyone have done something like this and can share examples.
Ps I am working with python and used tabula and Camelot

I had a similar case where the problem was to extract specific form information from pdfs (name, date of birth and so on). I used the tesseract open source software with pytesseract to perform OCR on the files . Since I did not need the whole pdfs, but specific information from them, I designed an algorithm to find the information: In my case I used simple heuristics (specific fields, specific line number and some other domain specific stuff), but you can also use a machine-learning approach and train a classifier which can find the needed text-parts. You could use domain-specific heuristics as well, because I am sure that a financial statement has special vocabulary or some text markers which indicate its beginning/its end.
I hope I could at least give you some ideas how to approach the problem
P.S.: With tesseract you can also process multipage pdfs. To 3) - Machine learning approach would need some samples to learn a good generalization of how a financial statement may look like.

Split pack of text files into multiple subsets according to the content of the files

I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.

This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.

Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)

Searching for data in a PDF

I've got a PDF file that I'm trying to obtain specific data from.
I've been able to parse the PDF via PyPDF2 into one long string but searching for specific data is difficult because of - I assume - formatting in the original PDF.
What I am looking to do is to retrieve specific known fields and the data that immediately follows (as formatted in the PDF) and then store these in seperate variables.
The PDFs are bills and hence are all presented in the exact same way, with defined fields and images. So what I am looking to do is to extract these fields.
What would be the best way to achieve this?

I've got a PDF file that I'm trying to obtain specific data from.
In general, it is probably impossible (or extremely difficult), and details (than you don't mention) are very important. Study in details the complex PDF specification. Notice that PDF is (more or less accidentally) Turing complete (so your problem is undecidable in general, since equivalent to the halting problem).
For example, a normal human reader could read digits in the document as text, or as a JPEG image, etc. And in practice many PDF documents have such kind of data.... Practically speaking, PDF is an output-only format and is designed for screen displaying and printing, not for extracting data from it.
You need to understand how exactly that PDF file was generated (with what exact software, from what actual data). That could take a lot of time (maybe several years of full time reverse-engineering work) without help.
A much better approach is to contact the person or entity providing that PDF file and negotiate some way of accessing the actual data (or at least get detailed explanation about the generation of that particular PDF file). For example, if the PDF file is computed from some database, you'll better access that database.
Perhaps using metadata or comments in your PDF file might help in guessing how it was generated.
The source of the data might produce various kinds of PDF file. For example, my cheap scanner is able to produce PDF. But your program would have hard time in extracting some numerical data from it (because that kind of PDF is essentially wrapping a pixelated image à la JPEG) and would need to deploy image recognition techniques (i.e. OCR) to do so.

Programmatically change font color of text in PDF

I'm not familiar with the PDF specification at all. I was wondering if it's possible to directly manipulate a PDF file so that certain blocks of text that I've identified as important are highlighted in colors of my choice. Language of choice would be python.

It's possible, but not necessarily easy, because the PDF format is so rich. You can find a document describing it in detail here. The first elementary example it gives about how PDFs display text is:
BT
/F13 12 Tf
288 720 Td
(ABC) Tj
ET
BT and ET are commands to begin and end a text object; Tf is a command to use external font resource F13 (which happens to be Helvetica) at size 12; Td is a command to position the cursor at the given coordinates; Tj is a command to write the glyphs for the previous string. The flavor is somewhat "reverse-polish notation"-oid, and indeed quite close to the flavor of Postscript, one of Adobe's other great contributions to typesetting.
The problem is, there is nothing in the PDF specs that says that text that "looks" like it belongs together on the page as displayed must actually "be" together; since precise coordinates can always be given, if the PDF is generated by a sophisticated typography layout system, it might position text precisely, character by character, by coordinates. Reconstructing text in form of words and sentences is therefore not necessarily easy -- it's almost as hard as optical text recognition, except that you are given the characters precisely (well -- almost... some alleged "images" might actually display as characters...;-).
pyPdf is a very simple pure-Python library that's a good starting point for playing around with PDF files. Its "text extraction" function is quite elementary and does nothing but concatenate the arguments of a few text-drawing commands; you'll see that suffices on some docs, and is quite unusable on others, but at least it's a start. As distributed, pyPdf does just about nothing with colors, but with some hacking that could be remedied.
reportlab's powerful Python library is entirely focused on generating new PDFs, not on interpreting or modifying existing ones. At the other extreme, pure Python library pdfminer entirely focusing on parsing PDF files; it does do some clustering to try and reconstruct text in cases in which simpler libraries would be stumped.
I don't know of an existing library that performs the transformational tasks you desire, but it should be feasible to mix and match some of these existing ones to get most of it done... good luck!

Highlight is possible in pdf file using PDF annotations but doing it natively is not that easy job. If any of the mentioned library provide such facility is something that you may look for.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.