Converting PDF to images automatically - python

So the state I'm in released a bunch of data in PDF form, but to make matters worse, most (all?) of the PDFs appear to be letters typed in Office, printed/fax, and then scanned (our government at its best eh?). At first I thought I was crazy, but then I started seeing numerous pdfs that are 'tilted', like someone didn't get them on the scanner properly. So, I figured the next best thing to getting the actual text out of them, would be to turn each page into an image.
Obviously this needs to be automated, and I'd prefer to stick with Python if possible. If Ruby or Perl have some form of implementation that's just too awesome to pass up, I can go that route. I've tried pyPDF for text extraction, that obviously didn't do me much good. I've tried swftools, but the images I'm getting from that are just shy of completely unusable. It just seems like the fonts get ruined in the conversion. I also don't even really care about the image format on the way out, just as long as they're relatively lightweight, and readable.

If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.
You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.

You could call e.g. pdftoppm from the command-line (or using Python's subprocess module) and then convert the resulting PPM files to the desired format using e.g. ImageMagick (again, using subprocess or some bindings if they exist).

Ghostscript is ideal for converting PDF files to images. It is reliable and has many configurable options. Its also available under the GPL license or commercial license. You can call it from the command line or use its native API. For more information:
Ghostscript Main Website
Ghostscript docs on Command line usage
Another stackoverflow thread that provides some examples of invoking Ghostscript's command line interface from Python
Ghostscript API Documentation

Here's an alternative approach to turning a .pdf file into images: Use an image printer. I've successfully used the function below to "print" pdf's to jpeg images with ImagePrinter Pro. However, there are MANY image printers out there. Pick the one you like. Some of the code may need to be altered slightly based on the image printer you pick and the standard file saving format that image printer uses.
import win32api
import os
def pdf_to_jpg(pdfPath, pages):
# print pdf using jpg printer
# 'pages' is the number of pages in the pdf
filepath = pdfPath.rsplit('/', 1)[0]
filename = pdfPath.rsplit('/', 1)[1]
#print pdf to jpg using jpg printer
tempprinter = "ImagePrinter Pro"
printer = '"%s"' % tempprinter
win32api.ShellExecute(0, "printto", filename, printer, ".", 0)
# Add time delay to ensure pdf finishes printing to file first
fileFound = False
if pages > 1:
jpgName = filename.split('.')[0] + '_' + str(pages - 1) + '.jpg'
else:
jpgName = filename.split('.')[0] + '.jpg'
jpgPath = filepath + '/' + jpgName
waitTime = 30
for i in range(waitTime):
if os.path.isfile(jpgPath):
fileFound = True
break
else:
time.sleep(1)
# print Error if the file was never found
if not fileFound:
print "ERROR: " + jpgName + " wasn't found after " + str(waitTime)\
+ " seconds"
return jpgPath
The resulting jpgPath variable tells you the path location of the last jpeg page of the pdf printed. If you need to get another page, you can easily add some logic to modify the path to get prior pages

in pdf_to_jpg(pdfPath)
6 # 'pages' is the number of pages in the pdf
7 filepath = pdfPath.rsplit('/', 1)[0]
----> 8 filename = pdfPath.rsplit('/', 1)[1]
9
10 #print pdf to jpg using jpg printer
IndexError: list index out of range

With Wand there are now excellent imagemagick bindings for Python that make this a very easy task.
Here is the code necessary for converting a single PDF file into a sequence of PNG images:
from wand.image import Image
input_path = "name_of_file.pdf"
output_name = "name_of_outfile_{index}.png"
source = Image(filename=upload.original.path, resolution=300, width=2200)
images = source.sequence
for i in range(len(images)):
Image(images[0]).save(filename=output_name.format(i))

Related

Use PDF as Background for another PDF PYTHON

I have two pdfs " file.pdf ; BACKGROUND.pdf ". i wanna use BACKGROUND.pdf as background for file.pdf with python.
i don't know where to start i'm a python developper beginner
Maybe you could convert your first PDF to an image using this :
https://www.geeksforgeeks.org/convert-pdf-to-image-using-python/
# import module
from pdf2image import convert_from_path
# Store Pdf with convert_from_path function
images = convert_from_path('example.pdf')
for i in range(len(images)):
# Save pages as images in the pdf
images[i].save('page'+ str(i) +'.jpg', 'JPEG')
Then use the image as background for the second file.
Or use another util :
How to programmatically add a background to a pdf?
qpdf --underlay "background.pdf" -- file.pdf output.pdf
Should work for most cases.
Python users can use https://github.com/pikepdf/pikepdf a wrapper around qpdf
documentation at https://pikepdf.readthedocs.io/en/latest/
especially for this case Overlays, underlays, watermarks, n-up

python ghostscript not closing output file

I'm trying to turn PDF files with one or many pages into images for each page. This is very much like the question found here. In fact, I'm trying to use the code from #Idan Yacobi in that post to accomplish this. His code looks like this:
import ghostscript
def pdf2jpeg(pdf_input_path, jpeg_output_path):
args = ["pdf2jpeg", # actual value doesn't matter
"-dNOPAUSE",
"-sDEVICE=jpeg",
"-r144",
"-sOutputFile=" + jpeg_output_path,
pdf_input_path]
ghostscript.Ghostscript(*args)
When I run the code I get the following output from python:
##### 238647312 c_void_p(238647312L)
When I look at the folder where the new .jpg image is supposed to be created, there is a file there with the new name. However, when I attempt to open the file, the image preview says "Windows Photo Viewer can't open this picture because the picture is being edited in another program."
It seems that for some reason Ghostscript opened the file and wrote to it, but didn't close it after it was done. Is there any way I can force that to happen? Or, am I missing something else?
I already tried changing the last line above to the code below to explicitly close ghostscript after it was done.
GS = ghostscript.Ghostscript(*args)
GS.exit()
I was having the same problem where the image files were kept open but when I looked into the ghostscript init.py file (found in the following directory: PythonDirectory\Lib\site-packages\ghostscript__init__.py), the exit method has a line commented.
The gs.exit(self._instance) line is commented by default but when you uncomment the line, the image files are being closed.
def exit(self):
global __instance__
if self._initialized:
print '#####', self._instance.value, __instance__
if __instance__:
gs.exit(self._instance) # uncomment this line
self._instance = None
self._initialized = False
I was having this same problem while batching a large number of pdfs, and I believe I've isolated the problem to an issue with the python bindings for Ghostscript, in that like you said, the image file is not properly closed. To bypass this, I had to go to using an os system call. so given your example, the function and call would be replaced with:
os.system("gs -dNOPAUSE -sDEVICE=jpeg -r144 -sOutputFile=" + jpeg_output_path + ' ' + pdf_input_path)
You may need to change "gs" to "gswin32c" or "gswin64c" depending on your operating system. This may not be the most elegant solution, but it fixed the problem on my end.
My work around was actually just to install an image printer and have Python print the PDF using the image printer instead, thus creating the desired jpeg image. Here's the code I used:
import win32api
def pdf_to_jpg(pdf_path):
"""
Turn pdf into jpg image(s) using jpg printer
:param pdf_path: Path of the PDF file to be converted
"""
# print pdf to jpg using jpg printer
tempprinter = "ImagePrinter Pro"
printer = '"%s"' % tempprinter
win32api.ShellExecute(0, "printto", pdf_path, printer, ".", 0)
I was having the same problem when running into a password protected PDF - ghostscript would crash and not close the PDF preventing me from deleting the PDF.
Kishan's solution was already applied for me and therefore it wouldn't help my problem.
I fixed it by importing GhostscriptError and instantiating an empty Ghostscript before a try/finally block like so:
from ghostscript import GhostscriptError
from ghostscript import Ghostscript
...
# in my decryptPDF function
GS = Ghostscript()
try:
GS = Ghostscript(*args)
finally:
GS.exit()
...
# in my function that runs decryptPDF function
try:
if PDFencrypted(append_file_path):
decryptPDF(append_file_path)
except GhostscriptError:
remove(append_file_path)
# more code to log and handle the skipped file
...
For those that stumble upon this with the same problem. I looked through the python ghostscript init file and discovered the ghostscript.cleanup() function/def.
Therefore, I was able to solve the problem by adding this simple one-liner to the end of my script [or the end of the loop].
ghostscript.cleanup()
Hope it helps someone else because it frustrated me for quite a while.

Issue writing temp images to temp pdf in pyramid with reportlabs

I am using python 3, with pyramid and reportlabs to generate dynamic pdfs.
I am having a issue writing images in to a pdf. I am using Reportlab in a web to generate a pdf with images, by my images are not stored locally, they are on a remote server. I am downloading them locally into a temp directory ( they are saving, I have checked) When i go to add the images to the pdf, they space is allocating but image is not showing up.
Here is my relevant code (simplified):
# creates pdf in memory
doc = SimpleDocTemplate(pdfName, pagesize=A4)
elements = []
for item in model['items']:
# image goes here:
if item['IMAGENAME']:
response = getImageFromRemoteServer(item['IMAGENAME'])
dir_filename = directory + item['IMAGENAME']
if response.status_code == 200:
with open(dir_filename, 'wb') as f:
for chunk in response.iter_content():
f.write(chunk)
questions.append(Image(dir_filename, width=2*inch, height=2*inch))
# create and save the pdf
doc.build(elements,canvasmaker=NumberedCanvas)
I have followed the user guide here https://www.reportlab.com/docs/reportlab-userguide.pdf and have tried the above way, plus embedded images (as the user guide says in the paragraph section) and putting the image in the table.
I also looked here: and it did not help me.
My question is really, what is the right what to download an image and put in a pdf?
EDIT: fixed code indentation
EDIT 2:
Answered, I was finally about to get the images in the PDF. I am not sure what was the trigger to get it to work. The only thing that know I change was now I am using urllib to do the request and before i was not. Here is the my working code (simplified for the question only, this is more abstracted and encapsulated in my code.):
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# image goes here:
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
dir_filename = directory + filename
url = get_url(settings, filename)
response = urllib.request.urlopen(url)
raw_data = response.read()
f = open(dir_filename, 'wb')
f.write(raw_data)
f.close()
response.close()
myImage = Image(dir_filename)
myImage.drawHeight = 2* inch
myImage.drawWidth = 2* inch
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
Make your code independent from where the files come from. Separate file/resource retrieval from document generation. Ensure that your toolset is working with local files. Encapsulate the code to load files in a loader class or function. The encapsulation is what matters. Noticed this again this week while looking at thumbor loader classes.
If that works, you know reportlab, PIL and your application basically work.
Then make your code work with remote files using URI like http://path/to/remote/files.
Afterwards you can switch from using your fileloader or your httploader depending on environment or use case.
Another option to go would be to make your code work with local files using URI like file://path/to/file
This way the only thing that changes when switching from local to remote is the URL. Probably you need a python library supporting this. requests library is well suited for downloading things, most probably it supports URL scheme file:// as well.
Most probably the lazy parameter was responsible that your first code sample did not render the images. Triggering reportlab PDF rendering outside of the context managers for temporary files could have lead to this behaviour.
reportlab.platypus.flowables.py (using version 3.1.8)
class Image(Flowable):
"""an image (digital picture). Formats supported by PIL/Java 1.4 (the Python/Java Imaging Library
are supported. At the present time images as flowables are always centered horozontally
in the frame. We allow for two kinds of lazyness to allow for many images in a document
which could lead to file handle starvation.
lazy=1 don't open image until required.
lazy=2 open image when required then shut it.
"""
_fixedWidth = 1
_fixedHeight = 1
def __init__(self, filename, width=None, height=None, kind='direct', mask="auto", lazy=1):
"""If size to draw at not specified, get it from the image."""
self.hAlign = 'CENTER'
self._mask = mask
fp = hasattr(filename,'read')
if fp:
self._file = filename
self.filename = repr(filename)
...
The last three lines of the code example tell you that you can pass an object that has a read method. This is exactly what a call to urllib.request.urlopen(url) returns. Using that memory buffer you create an instance of Image. No need to have write access to filesystem, no need to delete these files after PDF rendering. Applying our new knowledge to improve code readability. Since your use-case includes retrieval of remote resources using memory buffers that support python file API could be a much cleaner approach to assemble your PDF files.
from contextlib import closing
import urllib.request
doc = SimpleDocTemplate(pdfName, pagesize=A4)
# array of elements in the pdf
elements = []
for question in model['questions']:
# download image and create Image from file-like object
if question['IMAGEFILE']:
filename = question['IMAGEFILE']
image_url = get_url(settings, filename)
with closing(urllib.request.urlopen(image_url)) as image_file:
myImage = Image(image_file, width=2*inch, height=2*inch)
myImage.hAlign = "LEFT"
elements.append(myImage)
# create and save the pdf
doc.build(elements)
References
Coding with context managers

How to handle multi-page images in PythonMagick?

I want to convert some multi-pages .tif or .pdf files to individual .png images. From command line (using ImageMagick) I just do:
convert multi_page.pdf file_out.png
And I get all the pages as individual images (file_out-0.png, file_out-1.png, ...)
I would like to handle this file conversion within Python, unfortunately PIL cannot read .pdf files, so I want to use PythonMagick. I tried:
import PythonMagick
im = PythonMagick.Image('multi_page.pdf')
im.write("file_out%d.png")
or just
im.write("file_out.png")
But I only get 1 page converted to png.
Of course I could load each pages individually and convert them one by one. But there must be a way to do them all at once?
ImageMagick is not memory efficient, so if you try to read a large pdf, like 100 pages or so, the memory requirement will be huge and it might crash or seriously slow down your system. So after all reading all pages at once with PythonMagick is a bad idea, its not safe.
So for pdfs, I ended up doing it page by page, but for that I need to get the number of pages first using pyPdf, its reasonably fast:
pdf_im = pyPdf.PdfFileReader(file('multi_page.pdf', "rb"))
npage = pdf_im.getNumPages()
for p in npage:
im = PythonMagick.Image('multi_page.pdf['+ str(p) +']')
im.write('file_out-' + str(p)+ '.png')
A more complete example based on the answer by Ivo Flipse and http://p-s.co.nz/wordpress/pdf-to-png-using-pythonmagick/
This uses a higher resolution and uses PyPDF2 instead of older pyPDF.
import sys
import PyPDF2
import PythonMagick
pdffilename = sys.argv[1]
pdf_im = PyPDF2.PdfFileReader(file(pdffilename, "rb"))
npage = pdf_im.getNumPages()
print('Converting %d pages.' % npage)
for p in range(npage):
im = PythonMagick.Image()
im.density('300')
im.read(pdffilename + '[' + str(p) +']')
im.write('file_out-' + str(p)+ '.png')
I had the same problem and as a work around i used ImageMagick and did
import subprocess
params = ['convert', 'src.pdf', 'out.png']
subprocess.check_call(params)

How to get the diff of two PDF files using Python?

I need to find the difference between two PDF files. Does anybody know of any Python-related tool which has a feature that directly gives the diff of the two PDFs?
What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).
If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts.
This question deals with pdf to text conversion in python: Python module for converting PDF to text.
The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.
This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.
I do not know your use case, but for regression tests of script which generates pdf using reportlab, I do diff pdfs by
Converting each page to an image using ghostsript
Diffing each page against page image of standard pdf, using PIL
e.g
im1 = Image.open(imagePath1)
im2 = Image.open(imagePath2)
imDiff = ImageChops.difference(im1, im2)
This works in my case for flagging any changes introduced due to code changes.
Met the same question on my encrypted pdf unittest, neither pdfminer nor pyPdf works well for me.
Here are two commands (pdftocairo, pdftotext) work perfect on my test. (Ubuntu Install: apt-get install poppler-utils)
You can get pdf content by:
from subprocess import Popen, PIPE
def get_formatted_content(pdf_content):
cmd = 'pdftocairo -pdf - -' # you can replace "pdftocairo -pdf" with "pdftotext" if you want to get diff info
ps = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE)
stdout, stderr = ps.communicate(input=pdf_content)
if ps.returncode != 0:
raise OSError(ps.returncode, cmd, stderr)
return stdout
Seems pdftocairo can redraw pdf files, pdftotext can extract all text.
And then you can compare two pdf files:
c1 = get_formatted_content(open('f1.pdf').read())
c2 = get_formatted_content(open('f2.pdf').read())
print(cmp(c1, c2)) # for binary compare
# import difflib
# print(list(difflib.unified_diff(c1, c2))) # for text compare
Even though this question is quite old, my guess is that I can contribute to the topic.
We have several applications generating tons of PDFs. One of these apps is written in Python and recently I wanted to write integration tests to check if the PDF generation was working correctly.
Testing PDF generation is HARD, because the specs for PDF files are very complicated and non-deterministic. Two PDFs, generated with the same exact input data, will generate different files, so direct file comparison is discarded.
The solution: we have to go with testing the way they look like (because THAT should be deterministic!).
In our case, the PDFs are being generated with the reportlab package, but this doesn't matter from the test perspective, we just need a filename or the PDF blob (bytes) from the generator. We also need an expectation file containing a "good" PDF to compare with the one coming from the generator.
The PDFs are converted to images and then compared. This can be done in multiple ways, but we decided to use ImageMagick, because it is extremely versatile and very mature, with bindings for almost every programming language out there. For Python 3, the bindings are offered by the Wand package.
The test looks something like the following. Specific details of our implementation were removed and the example was simplified:
import os
from unittest import TestCase
from wand.image import Image
from app.generators.pdf import PdfGenerator
DIR = os.path.dirname(__file__)
class PdfGeneratorTest(TestCase):
def test_generated_pdf_should_match_expectation(self):
# `pdf` is the blob of the generated PDF
# If using reportlab, this is what you get calling `getpdfdata()`
# on a Canvas instance, after all the drawing is complete
pdf = PdfGenerator().generate()
# PDFs are vectorial, so we need to set a resolution when
# converting to an image
actual_img = Image(blob=pdf, resolution=150)
filename = os.path.join(DIR, 'expected.pdf')
# Make sure to use the same resolution as above
with Image(filename=filename, resolution=150) as expected:
diff = actual.compare(expected, metric='root_mean_square')
self.assertLess(diff[1], 0.01)
The 0.01 is as low as we can tolerate small differences. Considering that diff[1] varies from 0 to 1 using the root_mean_square metric, we are here accepting a difference up to 1% on all channels, comparing with the sample expected file.
Check this out, it can be useful: pypdf

Categories

Resources