Best way to get text of a pdf file? - python

How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf.
Can anyone explain which module in python is best for pdf extraction?
I tried to use PyPDF2 package but it gives me inconsistent results. Also, i would like a lot to have a way to get the tables, the images, and remove the headers and the footers at least consistently, it doesn't need to happens 100% of the times. Thanks for your answers, i just need to find the right library. Thanks!

From another post that asked pretty much the same:
The answer depends if the question is general or specific to a single form. Your approach is reasonable for the general case, but there will be variability. If you have a pdf form that is a single form or report that has been created with different data at each iteration consider converting the form from pdf to postscript then see if you can parse the postscript.
Two utilities do this: pdf2ps and pdftops Try each. This approach may benefit if you know some postscript. With some luck the needed fields may be simple text strings. Worth a try.

Related

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.
The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

Search office documents for string python

I've been looking for a fast and relatively easy way of searching (grep-ish) for user-defined strings in files of varying formats, i.e xlsx, docx, pptx, pdf using Python.
My research has led me to believe that there might not be a convenient way of doing this, as per a single module or similar. Am I forced to use a separate module for each file type? And if so are these approriate?
docx
openpyxl
pptx
slate
I also looked at forms of decompression to get to the xml-files containing actual text but it seems unwieldy. I just want to be sure that there is no simple, uniform way of handling all of these different filetypes.
Well, I've mostly figured it out. In the end I decided to use powershell combined with "itextsharp.dll" to process the files. It turned out to be simpler than using portable python. Thanks for the answers:-)

Finding the title of a PDF with Python

I have a PDF file, and I would like to extract its title into a string. By title I don't mean the title in the metadata, but the actual title written in the document. For example, from here I'd like to get "Official SAT® Practice Test 2014-15"
Is there any way to accomplish this?
I would take a look at PDFMiner. Essentially you can load your PDF programatically. Then you will need to do some type of analysis to figure out how to extract the title. Perhaps you try using the first until new line break, or some type of algorithmic approach. I recommend using a large set of PDFs where you know the title, and run your program against them to test to see if you successfully detect the title. Then you can use that code to process the PDFs where you don't know the title. This technique is commonly referred to as using a training set.

comparing two files and saving the report in any other file

I would like to compare the dat of two files and store the report in another file.I tried using winmerge by invoking cmd.exe using subprocess module in python3.2.i was able to get the difference report but wasnt able to save that report.Is there a way with winmerge or with any other comparing tools(diffmerge/kdiff3) to save the difference report using cmd.exe in windows7?please help
Though your question is quite old, I wonder it wasn't answered yet. I was searching myself for an answer and funnily I found yours. Perhaps you mix quite a lot questions into one mail. So I decided to answer the main headline, where I suppose you try to compare human readable file contents.
To compare two files, there is a difflib library which is part of the Python distribution.
By the way an example how to generate a utility to compare files can be found on Python's documentation website.
The link is here: Helpers for computing deltas
From there you can learn to create an option and save deltas to a e.g. textfile or something. Some of these examples contain also a git-difference-like output, which possibly helps you to solve your question.
This means, if you are able to execute your script, then other delta tools are not required. It makes not soo much sense to call other tools via Python on CMD and try to control them... :)
Maybe also this Website with explanations and code examples may help you:
difflib – Compare sequences
I hope that helps you a bit.
EDIT: I forgot to mention, that the last site contains a straightforward example how to generate an HTML output:
HTML Output

What's a good document standard to use programmatically?

I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson
Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.
Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.
You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.
I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table
Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.

Categories

Resources