Script to search for text from PDF

Script to search for text from PDF - python

Problem
On the Mac OS X platform, I would like to write a script, either in Python or Tcl to search for text within a PDF file and extract the relevant parts. I appreciate any help.
Background
I am writing scripts to look inside a PDF to determine if it is a bill, from what company, and for what period. Based on these information, I rename the PDF and move it to an appropriate directory. For example, file such as Statement_03948293929384.pdf might become 2012-07-15 Water Bill.pdf and moved to my Utilities folder.
What have I done so far?
I have searched for PDF-to-plain-text tools, but not found anything yet
I have looked into the Tcl wiki and found an example, but could not get it to work (I searched for text in PDF, but not found).
I am looking into pdf-parser.py by Didier Stevens
I heard of a Python package called pyPdf and will look at it next.
Update
I have found a command-line tool called pdftotext written by Glyph & Cog, LLC; built and packaged by Carsten Bluem. This tool is straight forward and it solves my problem. I am still looking out for those tools that can search PDF directly, without having to convert to text file.

I have successfully used PyODConverter to convert to/from PDFs (there is also a more powerful Java version). Once you have the PDF converted to text it should be trivial to do the searching. Also I believe iText should be capable of doing similar things, but I haven't tested it.

Related

Python + Linux - Excel to HTML (keeping format)

I'm looking for a way to convert excel to html while preserving formatting.
I know this is doable on windows due to the availability of some underlying win32 libraries, (eg via xlwings
Python - Excel to HTML (keeping format))
But I'm looking for a solution on Linux.
I've also come by Aspose Cells but this requires a paid license or else it will add a lot of extra junk to the output that needs to be scrubbed out.
And lastly I tried the python lib xlsx2html but it does a very poor job at preserving formatting.
Are there any suggestions for a Linux based solution? I'd also be interested in tools written in other languages that can be easily wrapped around via python.
Thanks in advance!
Update:
Here is an example of a random excel sheet I converted via excel itself that I would like to reproduce. It has some colors, some border variations, some merged cells and some font sizes to see if they all work.

You can use LibreOffice to convert an Excel file to a HTML file using the command line:
# --convert-to implies --headless so it's not mandatory to specify --headless
soffice --headless --convert-to html data.xlsx
You can refer to the documentation to know more about other CLI parameters.

I think you should search for Excel to HTML in the JS world not python (I am not saying it is not possible, but It's more usual in JS), I promise you will get better results.
In my opinion, finding a JS-based solution and make a python wrapper can be more helpful. Because in JS community, they struggled more than another communities to import and work with Excels.
Another idea is to change your approach, look for how you can import an Excel file in an embedded way or iframe inside an HTML page with JS and then export it.
But again, I highly recommend to check JS libraries or GitHub repositories, some of them care about formatting.

How to automate SAS enterprise guide reports with Python Script?

I tried with SASpy but it's not working. I am able to open the SAS .egp file but not able to run the multiple scripts within in sequence.
import os, sys, subprocess
def OpenProject(sas_exe, egp_path):
sasExe = sas_exe
sasEGpath = egp_path
subprocess.call([sasExe, sasEGpath])
sas_exe = path\path\
egp_path = path\path\path\
OpenProject(sas_exe, egp_path)

This depends a bit on exactly what the workflow is. A few side notes, then the full solution.
First: EGP is not really intended to store production processes, in my opinion. EGP should really be used for development, then production is done with .sas (text) files. EGP can directly store the nodes as .sas files; ask a new question about that if you want to know more, but it's pretty easy to figure out. Best practice is to have EGP save the code modules as .sas files, then run those - SASPy will easily do that for you.
Second: If you use SAS's built-in Git connectivity, then you can do this a bit more easily I suspect. Consider doing that if you already use Git for your other processes. Again, then you end up with a .sas file, and can directly run that via SASPy.
So: how can you do this in Python, with the assumption you do have to use the .egp itself, without too many different moving parts? The key here is the .egp format. EGP is a container file, which is actually a .zip format container that has in it, among other things, all of the SAS code you want to run, as text. Text in xml format, but still, text.
You can write a python program that opens the .egp as a .zip file, using the zipfile library, and then use xml.etree.ElementTree to parse the project.xml file inside that project. Exactly what you do from there depends on your particular details, and is well out of scope for a Stack Overflow answer, but if you do better visually you can simply rename the .egp to .zip and then open in unzip program of your choice, then browse project.xml in your text editor, and find the nodes and code related to those nodes.
You can then extract the .sas code as text, and submit it directly via SASPy, or extract it to a .sas file and then submit that however you prefer (SASPy or something else).
I do something similar to this for a project - I don't actually run code from it, I'm just parsing it to verify that the correct programs were synced from the EGP to production - but it would be trivial to actually submit the code from what I've written, which is about 50 lines of code total. I may write a SGF paper this year or next year on this topic, in which case I'll try and remember to submit it here - or you can head over to my github page and see if it's there (in the future!).

Compress PDFs using Python

So I have a gazillion pdfs in a folder, I want to recursively (using os.path.walk) shrink them. I see that adobe pro has a save as reduced size. Would I be able to use this / how do you suggest I do it otherwise.
Note: Yes, I would like them to stay as pdfs because I find that to be the most commonly used and installed fileviewer.

From the project's GitHub page for pdfsizeopt, which is written in Python:
pdfsizeopt is a program for converting large PDF files to small ones. More specifically, pdfsizeopt is a free, cross-platform command-line application (for Linux, Mac OS X, Windows and Unix) and a collection of best practices to optimize the size of PDF files, with focus on PDFs created from TeX and LaTeX documents. pdfsizeopt is written in Python..."
You can probably easily adapt this to your specific needs.

Realize this is an old question. Thought I would suggest an alternative to pdfsizeopt, as I have experienced quality loss using it for PDFs of maps. PDFTron offers a comprehensive set of functionality. Here is a snippet modified from their web-page (see "example 1"):
import site
site.addsitedir(r"...pathToPDFTron\PDFNetWrappersWin32\PDFNetC\Lib")
from PDFNetPython import PDFDoc, Optimizer, SDFDoc
doc = PDFDoc(inPDF_Path)
doc.InitSecurityHandler()
Optimizer.Optimize(doc)
doc.Save(outPDF_Path, SDFDoc.e_linearized)
doc.Close()

What's a good document standard to use programmatically?

I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson

Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.

Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.

You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.

I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table

Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.

Using Sphinx to create context-sensitive help files in HTML

I am currently using AsciiDoc for documenting my software projects because it supports PDF and HTML help generation. I am currently running it through Cygwin so that the a2x toolchain functions properly. This works well for me but is a pain to setup on other Windows computers. I have been looking for alternative methods and recently revisited Sphinx. Noticing that it now produces HTML help files I gave it a try and it seems to work well in the small tests I performed.
My question is, is there a way to specify map id's for context sensitive help in the text so that my Windows programs can call the proper help API and the file is launched and opened to the desired location?
In AsciiDoc I am using pass::[<?dbhh topicname="_about" topicid="801"?>]. By using these constructs a context.h and alias.h are generated along with the other HTML help files (context sensitive help information).

I do not know about AcsiiDoc much, but in Sphinx you can reference arbitrary locations by placing anchors where you need them. See :ref: role.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.