How to get the diff of two PDF files using Python?

How to get the diff of two PDF files using Python? - python

I need to find the difference between two PDF files. Does anybody know of any Python-related tool which has a feature that directly gives the diff of the two PDFs?

What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).
If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts.
This question deals with pdf to text conversion in python: Python module for converting PDF to text.
The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.
This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.

I do not know your use case, but for regression tests of script which generates pdf using reportlab, I do diff pdfs by
Converting each page to an image using ghostsript
Diffing each page against page image of standard pdf, using PIL
e.g
im1 = Image.open(imagePath1)
im2 = Image.open(imagePath2)
imDiff = ImageChops.difference(im1, im2)
This works in my case for flagging any changes introduced due to code changes.

Met the same question on my encrypted pdf unittest, neither pdfminer nor pyPdf works well for me.
Here are two commands (pdftocairo, pdftotext) work perfect on my test. (Ubuntu Install: apt-get install poppler-utils)
You can get pdf content by:
from subprocess import Popen, PIPE
def get_formatted_content(pdf_content):
cmd = 'pdftocairo -pdf - -' # you can replace "pdftocairo -pdf" with "pdftotext" if you want to get diff info
ps = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE)
stdout, stderr = ps.communicate(input=pdf_content)
if ps.returncode != 0:
raise OSError(ps.returncode, cmd, stderr)
return stdout
Seems pdftocairo can redraw pdf files, pdftotext can extract all text.
And then you can compare two pdf files:
c1 = get_formatted_content(open('f1.pdf').read())
c2 = get_formatted_content(open('f2.pdf').read())
print(cmp(c1, c2)) # for binary compare
# import difflib
# print(list(difflib.unified_diff(c1, c2))) # for text compare

Even though this question is quite old, my guess is that I can contribute to the topic.
We have several applications generating tons of PDFs. One of these apps is written in Python and recently I wanted to write integration tests to check if the PDF generation was working correctly.
Testing PDF generation is HARD, because the specs for PDF files are very complicated and non-deterministic. Two PDFs, generated with the same exact input data, will generate different files, so direct file comparison is discarded.
The solution: we have to go with testing the way they look like (because THAT should be deterministic!).
In our case, the PDFs are being generated with the reportlab package, but this doesn't matter from the test perspective, we just need a filename or the PDF blob (bytes) from the generator. We also need an expectation file containing a "good" PDF to compare with the one coming from the generator.
The PDFs are converted to images and then compared. This can be done in multiple ways, but we decided to use ImageMagick, because it is extremely versatile and very mature, with bindings for almost every programming language out there. For Python 3, the bindings are offered by the Wand package.
The test looks something like the following. Specific details of our implementation were removed and the example was simplified:
import os
from unittest import TestCase
from wand.image import Image
from app.generators.pdf import PdfGenerator
DIR = os.path.dirname(__file__)
class PdfGeneratorTest(TestCase):
def test_generated_pdf_should_match_expectation(self):
# `pdf` is the blob of the generated PDF
# If using reportlab, this is what you get calling `getpdfdata()`
# on a Canvas instance, after all the drawing is complete
pdf = PdfGenerator().generate()
# PDFs are vectorial, so we need to set a resolution when
# converting to an image
actual_img = Image(blob=pdf, resolution=150)
filename = os.path.join(DIR, 'expected.pdf')
# Make sure to use the same resolution as above
with Image(filename=filename, resolution=150) as expected:
diff = actual.compare(expected, metric='root_mean_square')
self.assertLess(diff[1], 0.01)
The 0.01 is as low as we can tolerate small differences. Considering that diff[1] varies from 0 to 1 using the root_mean_square metric, we are here accepting a difference up to 1% on all channels, comparing with the sample expected file.

Check this out, it can be useful: pypdf

Related

Extracting Powerpoint background images using python-pptx

I have several powerpoints that I need to shuffle through programmatically and extract images from. The images then need to be converted into OpenCV format for later processing/analysis. I have done this successfully for images in the pptx, using:
for slide in presentation:
for shape in slide.shapes
if 'Picture' in shape.name:
pic_list.append(shape)
for extraction, and:
img = cv2.imdecode(np.frombuffer(page[i].image.blob, np.uint8), cv2.IMREAD_COLOR)
for python-pptx Picture to OpenCV conversion. However, I am having a lot of trouble extracting and manipulating the backgrounds in a similar fashion.
slide.background
is sufficient to extract a "_Background" object, but I have not found a good way to convert it into a OpenCV object similar to Pictures. Does anyone know how to do this? I am using python-pptx for extraction, but am not adverse to other packages if it's not possible with that package.

After a fair bit of work I discovered how to do this -- i.e., you don't. As far as I can tell, there is no way to directly extract the backgrounds with either python-pptx or Aspose. Powerpoint -- which, as it turns out, is an archive that can be unzipped with 7zip -- keeps its backgrounds disassembled in the ppt/media (pics), ppt/slideLayouts and ppt/slideMasters (text, formatting), and they are only put together by the Powerpoint renderer. This means that to extract the backgrounds as displayed, you basically need to run Powerpoint and take pics of the slides after removing text/pictures/etc. from the foreground.
I did not need to do this, as I just needed to extract text from the backgrounds. This can be done by checking slideLayouts and slideMasters XMLs using BeautifulSoup, at the <a:t> tag. The code to do this is pretty simple:
import zipfile
with zipfile.ZipFile(pptx_path, 'r') as zip_ref:
zip_ref.extractall(extraction_directory)
This will extract the .pptx into its component files.
from glob import glob
layouts = glob(os.path.join(extr_dir, 'ppt\slideLayouts\*.xml'))
masters = glob(os.path.join(extr_dir, 'ppt\slideMasters\*.xml'))
files = layouts + masters
This gets you the paths for slide layouts/masters.
from bs4 import BeautifulSoup
text_list = []
for file in files:
with open(file) as f:
data = f.read()
bs_data = BeautifulSoup(data, "xml")
bs_a_t = bs_data.find_all('a:t')
for a_t in bs_a_t:
text_list.append(str(a_t.contents[0]))
This will get you the actual text from the XMLs.
Hopefully this will be useful to someone else in the future.

What is the best way to save image metadata alongside a tif?

In my work as a grad student, I capture microscope images and use python to save them as raw tif's. I would like to add metadata such as the name of the microscope I am using, the magnification level, and the imaging laser wavelength. These details are all important for how I post-process the images.
I should be able to do this with a tif, right? Since it has a header?
I was able to add to the info in a PIL image:
im.info['microscope'] = 'george'
but when I save and load that image, the info I added is gone.
I'm open to all suggestions. If I have too, I'll just save a separate .txt file with the metadata, but it would be really nice to have it embedded in the image.

Tifffile is one option for saving microscopy images with lots of metadata in python.
It doesn't have a lot of external documentation, but the docstings are great so you can get a lot of info just by typing help(tifffile) in python, or go look at the source code.
You can look at the TiffWriter.save function in the source code (line 750) for the different keyword arguments you can use to write metadata.
One is to use description, which accepts a string. It will show up as the tag "ImageDescription" when you read your image.
Another is to use the extratags argument, which accepts a list of tuples. That allows you to write any tag name that exist in TIFF.TAGS(). One of the easiest way is to write them as strings because then you don't have to specify length.
You can also write ImageJ metadata with ijmetadata, for which the acceptable types are listed in the source code here.
As an example, if you write the following:
import json
import numpy as np
import tifffile
im = np.random.randint(0, 255, size=(150, 100), dtype=np.uint8)
# Description
description = "This is my description"
# Extratags
metadata_tag = json.dumps({"ChannelIndex": 1, "Slice": 5})
extra_tags = [("MicroManagerMetadata", 's', 0, metadata_tag, True),
("ProcessingSoftware", 's', 0, "my_spaghetti_code", True)]
# ImageJ metadata. 'Info' tag needs to be a string
ijinfo = {"InitialPositionList": [{"Label": "Pos1"}, {"Label": "Pos3"}]}
ijmetadata = {"Info": json.dumps(ijinfo)}
# Write file
tifffile.imsave(
save_name,
im,
ijmetadata=ijmetadata,
description=description,
extratags=extra_tags,
)
You can see the following tags when you read the image:
frames = tifffile.TiffFile(save_name)
page = frames.pages[0]
print(page.tags["ImageDescription"].value)
Out: 'this is my description'
print(page.tags["MicroManagerMetadata"].value)
Out: {'ChannelIndex': 1, 'Slice': 5}
print(page.tags["ProcessingSoftware"].value)
Out: my_spaghetti_code

For internal use, try saving the metadata as JSON in the TIFF ImageDescription tag, e.g.
from __future__ import print_function, unicode_literals
import json
import numpy
import tifffile # http://www.lfd.uci.edu/~gohlke/code/tifffile.py.html
data = numpy.arange(256).reshape((16, 16)).astype('u1')
metadata = dict(microscope='george', shape=data.shape, dtype=data.dtype.str)
print(data.shape, data.dtype, metadata['microscope'])
metadata = json.dumps(metadata)
tifffile.imsave('microscope.tif', data, description=metadata)
with tifffile.TiffFile('microscope.tif') as tif:
data = tif.asarray()
metadata = tif[0].image_description
metadata = json.loads(metadata.decode('utf-8'))
print(data.shape, data.dtype, metadata['microscope'])
Note that JSON uses unicode strings.
To be compatible with other microscopy software, consider saving OME-TIFF files, which store defined metadata as XML in the ImageDescription tag.

I should be able to do this with a tif, right? Since it has a header?
No.
First, your premise is wrong, but that's a red herring. TIFF does have a header, but it doesn't allow you to store arbitrary metadata in it.
But TIFF is a tagged file format, a series of chunks of different types, so the header isn't important here. And you can always create your own private chunk (any ID > 32767) and store anything you want there.
The problem is, nothing but your own code will have any idea what you stored there. So, what you probably want is to store EXIF or XMP or some other standardized format for extending TIFF with metadata. But even there, EXIF or whatever you choose isn't going to have a tag for "microscope", so ultimately you're going to end up having to store something like "microscope=george\nspam=eggs\n" in some string field, and then parse it back yourself.
But the real problem is that PIL/Pillow doesn't give you an easy way to store EXIF or XMP or anything else like that.
First, Image.info isn't for arbitrary extra data. At save time, it's generally ignored.
If you look at the PIL docs for TIFF, you'll see that it reads additional data into a special attribute, Image.tag, and can save data by passing a tiffinfo keyword argument to the Image.save method. But that additional data is a mapping from TIFF tag IDs to binary hunks of data. You can get the Exif tag IDs from the undocumented PIL.ExifTags.TAGS dict (or by looking them up online yourself), but that's as much support as PIL is going to give you.
Also, note that accessing tag and using tiffinfo in the first place requires a reasonably up-to-date version of Pillow; older versions, and classic PIL, didn't support it. (Ironically, they did have partial EXIF support for JPG files, which was never finished and has been stripped out…) Also, although it doesn't seem to be documented, if you built Pillow without libtiff it seems to ignore tiffinfo.
So ultimately, what you're probably going to want to do is:
Pick a metadata format you want.
Use a different library than PIL/Pillow to read and write that metadata. (For example, you can use GExiv2 or pyexif for EXIF.)

You could try setting tags in the tag property of a TIFF image. This is an ImageFileDirectory object. See TiffImagePlugin.py.
Or, if you have libtiff installed, you can use the subprocess module to call the tiffset command to set a field in the header after you have saved the file. There are online references of available tags.
According to this page:
If one needs more than 10 private tags or so, the TIFF specification suggests that, rather then using a large amount of private tags, one should instead allocate a single private tag, define it as datatype IFD, and use it to point to a socalled 'private IFD'. In that private IFD, one can next use whatever tags one wants. These private IFD tags do not need to be properly registered with Adobe, they live in a namespace of their own, private to the particular type of IFD.
Not sure if PIL supports this, though.

Saving thumbnails as fits files

Most of my code takes a .fits file and creates small thumbnail images that are based upon certain parameters (they're images of galaxies, and all this is extraneous information . . .)
Anyways, I managed to figure out a way to save the images as a .pdf, but I don't know how to save them as .fits files instead. The solution needs to be something within the "for" loop, so that it can just save the files en masse, because there are way too many thumbnails to iterate through one by one.
The last two lines are the most relevant ones.
for i in range(0,len(ra_new)):
ra_new2=cat['ra'][z&lmass&ra&dec][i]
dec_new2=cat['dec'][z&lmass&ra&dec][i]
target_pixel_x = ((ra_new2-ra_ref)/(pixel_size_x))+reference_pixel_x
target_pixel_y = ((dec_new2-dec_ref)/(pixel_size_y))+reference_pixel_y
value=img[target_pixel_x,target_pixel_y]>0
ra_new3=cat['ra'][z&lmass&ra&dec&value][i]
dec_new_3=cat['dec'][z&lmass&ra&dec&value][i]
new_target_pixel_x = ((ra_new3-ra_ref)/(pixel_size_x))+reference_pixel_x
new_target_pixel_y = ((dec_new3-dec_ref)/(pixel_size_y))+reference_pixel_y
fig = plt.figure(figsize=(5.,5.))
plt.imshow(img[new_target_pixel_x-200:new_target_pixel_x+200, new_target_pixel_y-200:new_target_pixel_y+200], vmin=-0.01, vmax=0.1, cmap='Greys')
fig.savefig(image+"PHOTO"+str(i)+'.pdf')
Any ideas SO?

For converting FITS images to thumbnails, I recommend using the mJPEG tool from the "Montage" software package, available here: http://montage.ipac.caltech.edu/docs/mJPEG.html
For example, to convert a directory of FITS images to JPEG files, and then resize them to thumbnails, I would use a shell script like this:
#!/bin/bash
for FILE in `ls /path/to/images/*.fits`; do
mJPEG -gray $FILE 5% 90% log -out $FILE.jpg
convert $FILE.jpg -resize 64x64 $FILE.thumbnail.jpg
done
You can, of course, call these commands from Python instead of a shell script.

As noted in a comment, the astropy package (if not yet installed) will be useful:
http://astropy.readthedocs.org. You can import the required module at the beginning.
from astropy.io import fits
At the last line, you can save a thumbnail FITS file.
thumb = img[new_target_pixel_x-200:new_target_pixel_x+200,
new_target_pixel_y-200:new_target_pixel_y+200]
fits.writeto(image+str(i).zfill(3)+'.fits',thumb)

Convert SVG to PDF (svglib + reportlab not good enough)

I'm creating some SVGs in batches and need to convert those to a PDF document for printing. I've been trying to use svglib and its svg2rlg method but I've just discovered that it's absolutely appalling at preserving the vector graphics in my document. It can barely position text correctly.
My dynamically-generated SVG is well formed and I've tested svglib on the raw input to make sure it's not a problem I'm introducing.
So what are my options past svglib and ReportLab? It either has to be free or very cheap as we're already out of budget on the project this is part of. We can't afford the 1k/year fee for ReportLab Plus.
I'm using Python but at this stage, I'm happy as long as it runs on our Ubuntu server.
Edit: Tested Prince. Better but it's still ignoring half the document.

I use inkscape for this. In your django view do like:
from subprocess import Popen
x = Popen(['/usr/bin/inkscape', your_svg_input, \
'--export-pdf=%s' % your_pdf_output])
try:
waitForResponse(x)
except OSError, e:
return False
def waitForResponse(x):
out, err = x.communicate()
if x.returncode < 0:
r = "Popen returncode: " + str(x.returncode)
raise OSError(r)
You may need to pass as parameters to inkscape all the font files you refer to in your .svg, so keep that in mind if your text does not appear correctly on the .pdf output.

CairoSVG is the one I am using:
import cairosvg
cairosvg.svg2pdf(url='image.svg', write_to='image.pdf')

rst2pdf uses reportlab for generating PDFs. It can use inkscape and pdfrw for reading PDFs.
pdfrw itself has some examples that show reading PDFs and using reportlab to output.
Addressing the comment by Martin below (I can edit this answer, but do not have the reputation to comment on a comment on it...):
reportlab knows nothing about SVG files. Some tools, like svg2rlg, attempt to recreate an SVG image into a PDF by drawing them into the reportlab canvas. But you can do this a different way with pdfrw -- if you can use another tool to convert the SVG file into a PDF image, then pdfrw can take that converted PDF, and add it as a form XObject into the PDF that you are generating with reportlab. As far as reportlab is concerned, it is really no different than placing a JPEG image.
Some tools will do terrible things to your SVG files (rasterizing them, for example). In my experience, inkscape usually does a pretty good job, and leaves them in a vector format. You can even do this headless, e.g. "inkscape my.svg -A my.pdf".
The entire reason I wrote pdfrw in the first place was for this exact use-case -- being able to reuse vector images in new PDFs created by reportlab.

Just to let you know and for the future issue, I find a solution for this problem:
# I only install svg2rlg, not svglib (svg2rlg is inside svglib as well)
import svg2rlg
# Import of the canvas
from reportlab.pdfgen import canvas
# Import of the renderer (image part)
from reportlab.graphics import renderPDF
rlg = svg2rlg.svg2rlg("your_img.svg")
c = canvas.Canvas("example.pdf")
c.setTitle("my_title_we_dont_care")
# Generation of the first page
# You have a last option on this function,
# about the boundary but you can leave it as default.
renderPDF.draw(rlg, c, 80, 740 - rlg.height)
renderPDF.draw(rlg, c, 60, 540 - rlg.height)
c.showPage()
# Generation of the second page
renderPDF.draw(rlg, c, 50, 740 - rlg.height)
c.showPage()
# Save
c.save()
Enjoy a bit with the position (80, 740 - h), it is only the position.
If the code doesn't work, you can look at in the render's reportlab library.
You have a function in reportlab to create directly a pdf from your image:
renderPDF.drawToFile(rlg, "example.pdf", "title")
You can open it and read it. It is not very complicated. This code come from this function.

Converting PDF to images automatically

So the state I'm in released a bunch of data in PDF form, but to make matters worse, most (all?) of the PDFs appear to be letters typed in Office, printed/fax, and then scanned (our government at its best eh?). At first I thought I was crazy, but then I started seeing numerous pdfs that are 'tilted', like someone didn't get them on the scanner properly. So, I figured the next best thing to getting the actual text out of them, would be to turn each page into an image.
Obviously this needs to be automated, and I'd prefer to stick with Python if possible. If Ruby or Perl have some form of implementation that's just too awesome to pass up, I can go that route. I've tried pyPDF for text extraction, that obviously didn't do me much good. I've tried swftools, but the images I'm getting from that are just shy of completely unusable. It just seems like the fonts get ruined in the conversion. I also don't even really care about the image format on the way out, just as long as they're relatively lightweight, and readable.

If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.
You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.

You could call e.g. pdftoppm from the command-line (or using Python's subprocess module) and then convert the resulting PPM files to the desired format using e.g. ImageMagick (again, using subprocess or some bindings if they exist).

Ghostscript is ideal for converting PDF files to images. It is reliable and has many configurable options. Its also available under the GPL license or commercial license. You can call it from the command line or use its native API. For more information:
Ghostscript Main Website
Ghostscript docs on Command line usage
Another stackoverflow thread that provides some examples of invoking Ghostscript's command line interface from Python
Ghostscript API Documentation

Here's an alternative approach to turning a .pdf file into images: Use an image printer. I've successfully used the function below to "print" pdf's to jpeg images with ImagePrinter Pro. However, there are MANY image printers out there. Pick the one you like. Some of the code may need to be altered slightly based on the image printer you pick and the standard file saving format that image printer uses.
import win32api
import os
def pdf_to_jpg(pdfPath, pages):
# print pdf using jpg printer
# 'pages' is the number of pages in the pdf
filepath = pdfPath.rsplit('/', 1)[0]
filename = pdfPath.rsplit('/', 1)[1]
#print pdf to jpg using jpg printer
tempprinter = "ImagePrinter Pro"
printer = '"%s"' % tempprinter
win32api.ShellExecute(0, "printto", filename, printer, ".", 0)
# Add time delay to ensure pdf finishes printing to file first
fileFound = False
if pages > 1:
jpgName = filename.split('.')[0] + '_' + str(pages - 1) + '.jpg'
else:
jpgName = filename.split('.')[0] + '.jpg'
jpgPath = filepath + '/' + jpgName
waitTime = 30
for i in range(waitTime):
if os.path.isfile(jpgPath):
fileFound = True
break
else:
time.sleep(1)
# print Error if the file was never found
if not fileFound:
print "ERROR: " + jpgName + " wasn't found after " + str(waitTime)\
+ " seconds"
return jpgPath
The resulting jpgPath variable tells you the path location of the last jpeg page of the pdf printed. If you need to get another page, you can easily add some logic to modify the path to get prior pages

in pdf_to_jpg(pdfPath)
6 # 'pages' is the number of pages in the pdf
7 filepath = pdfPath.rsplit('/', 1)[0]
----> 8 filename = pdfPath.rsplit('/', 1)[1]
9
10 #print pdf to jpg using jpg printer
IndexError: list index out of range

With Wand there are now excellent imagemagick bindings for Python that make this a very easy task.
Here is the code necessary for converting a single PDF file into a sequence of PNG images:
from wand.image import Image
input_path = "name_of_file.pdf"
output_name = "name_of_outfile_{index}.png"
source = Image(filename=upload.original.path, resolution=300, width=2200)
images = source.sequence
for i in range(len(images)):
Image(images[0]).save(filename=output_name.format(i))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.