Comparing PDF files with varying degrees of strictness

Comparing PDF files with varying degrees of strictness - python

I have two folders, each including ca. 100 PDF files resulting from different runs of the same PDF generation program. After performing some changes to this program, the resulting PDF should always stay equal and nothing should break the layout, the fonts, any potential graphs and so on. This is why I would like to check for visual equality while ignoring any metadata that might have changed due to running the program at different times.
My first approach was based on this post and attempted to compare the hashes of each file:
h1 = hashlib.sha1()
h2 = hashlib.sha1()
with open(fileName1, "rb") as file:
chunk = 0
while chunk != b'':
chunk = file.read(1024)
h1.update(chunk)
with open(fileName2, "rb") as file:
chunk = 0
while chunk != b'':
chunk = file.read(1024)
h2.update(chunk)
return (h1.hexdigest() == h2.hexdigest())
This always returns "False". I assume that this is due to different time dependent metadata, which is why I would like to ignore them. I've already found a way to set the modification and creation data to "None":
pdf1 = pdfrw.PdfReader(fileName1)
pdf1.Info.ModDate = pdf1.Info.CreationDate = None
pdfrw.PdfWriter().write(fileName1, pdf1)
pdf2 = pdfrw.PdfReader(fileName2)
pdf2.Info.ModDate = pdf2.Info.CreationDate = None
pdfrw.PdfWriter().write(fileName2, pdf2)
Looping through all files in each folder and running the second method before the first curiously sometimes results in a return value of "True" and sometimes in a return value of "False".
Thanks to the kind help of #jorj-mckie (see answer below), I've the following methods checking for xref equality:
doc1 = fitz.open(fileName1)
xrefs1 = doc1.xref_length() # cross reference table 1
doc2 = fitz.open(fileName2)
xrefs2 = doc2.xref_length() # cross reference table 2
if (xrefs1 != xrefs2):
print("Files are not equal")
return False
for xref in range(1, xrefs1): # loop over objects, index 0 must be skipped
# compare the PDF object definition sources
if (doc1.xref_object(xref) != doc2.xref_object(xref)):
print(f"Files differ at xref {xref}.")
return False
if doc1.xref_is_stream(xref): # compare binary streams
stream1 = doc1.xref_stream_raw(xref) # read binary stream
try:
stream2 = doc2.xref_stream_raw(xref) # read binary stream
except: # stream extraction doc2 did not work!
print(f"stream discrepancy at xref {xref}")
return False
if (stream1 != stream2):
print(f"stream discrepancy at xref {xref}")
return False
return True
and xref equality without metadata:
doc1 = fitz.open(fileName1)
xrefs1 = doc1.xref_length() # cross reference table 1
doc2 = fitz.open(fileName2)
xrefs2 = doc2.xref_length() # cross reference table 2
info1 = doc1.xref_get_key(-1, "Info") # extract the info object
info2 = doc2.xref_get_key(-1, "Info")
if (info1 != info2):
print("Unequal info objects")
return False
if (info1[0] == "xref"): # is there metadata at all?
info_xref1 = int(info1[1].split()[0]) # xref of info object doc1
info_xref2 = int(info2[1].split()[0]) # xref of info object doc1
else:
info_xref1 = 0
for xref in range(1, xrefs1): # loop over objects, index 0 must be skipped
# compare the PDF object definition sources
if (xref != info_xref1):
if (doc1.xref_object(xref) != doc2.xref_object(xref)):
print(f"Files differ at xref {xref}.")
return False
if doc1.xref_is_stream(xref): # compare binary streams
stream1 = doc1.xref_stream_raw(xref) # read binary stream
try:
stream2 = doc2.xref_stream_raw(xref) # read binary stream
except: # stream extraction doc2 did not work!
print(f"stream discrepancy at xref {xref}")
return False
if (stream1 != stream2):
print(f"stream discrepancy at xref {xref}")
return False
return True
If I run the last two functions on my PDF files, whose timestamps have already been set to "None" (see above), I end up with some equality checks resulting in a "True" return value and others resulting in "False".
I'm using the reportlab library to generate the PDFs. Do I just have to live with the fact that some PDFs will always have a different internal structure, resulting in different hashes even if the files look exactly the same? I would be very happy to learn that this is not the case and there is indeed a way to check for equality without actually having to export all pages to images first.

I think you should use PyMuPDF for PDF handling - it has all batteries included for your task (and many more!).
First thing to clarify:
What type of equality are you looking for? If just number of pages must be equal and pages should look the same pairwise, is very much different from all object and streams must be identical with the exception of the PDF /ID.
Both comparison types are possible with PyMuPDF. To do the latter comparison, loop through both object number tables and compare them pairwise:
import sys
import fitz # import package PyMuPDF
doc1 = fitz.open("file1.pdf")
xrefs1 = doc1.xref_length() # cross reference table 1
doc2 = fitz.open("file2.pdf")
xrefs2 = doc2.xref_length() # cross reference table 2
if xref1 != xref2:
sys.exit("Files are not equal") # quick exit
for xref in range(1, xrefs1): # loop over objects, index 0 must be skipped
# compare the PDF object definition sources
if doc1.xref_object(xref) != doc2.xref_object(xref):
sys.exit(f"Files differ at xref {xref}.")
if doc1.xref_is_stream(xref): # compare binary streams
stream1 = doc1.xref_stream_raw(xref) # read binary stream
try:
stream2 = doc2.xref_stream_raw(xref) # read binary stream
except: # stream extraction doc2 did not work!
sys.exit(f"stream discrepancy at xref {xref}")
if stream1 != stream2:
sys.exit(f"stream discrepancy at xref {xref}")
sys.exit("Files are equal!")
This still is a rather strict equality check: For example, if any date or time in the document metadata has changed, you would report inequality even if the rest is equal.
But there is help: Determine the xref of the metadata and exclude it from the above loop:
info1 = doc1.xref_get_key(-1, "Info") # extract the info object
info2 = doc2.xref_get_key(-1, "Info")
if info1 != info2:
sys.exit("Unequal info objects")
if info1[0] == "xref" # is there metadata at all?
info_xref1 = int(info1[1].split()[0]) # xref of info object doc1
info_xref2 = int(info2[1].split()[0]) # xref of info object doc1
# make another equality here
# in above loop skip if xref == info_xref1.
else:
info_xref1 = 0 # 0 is never an xref number, so can safely be used in loop

Command line/ GUI pdf differs have been around a long time and many PDF difference tools available, like this cross platform one ( https://github.com/vslavik/diff-pdf) are available as both CLI and executable GUI, so best of both worlds.
By default, its only output is its return code, which is 0 if there are no differences and 1 if the two PDFs differ. If given the --output-diff option, it produces a PDF file with visually highlighted differences:
Others more specifically built for cross platform python tend to separate text differences 2 ways so you could try https://github.com/JoshData/pdf-diff, or for graphically there is https://github.com/bgeron/diff-pdf-visually
So by way of example for above dual purpose diff-pdf text you can quickly parse a folder to collect the true false report by run compare blind in pairs then as a result do final one by one compare as visual by shell out to:-
diff-pdf --view a.pdf b.pdf
note this is version 0.4 but 0.5 is available.
Sadly if all 100 are similar by simple compare then all need text testing thus you need a fast binary test batch file to run APPROX 4,950 (99x100/2) fast tests.
test 1.pdf 2.pdf report
test 1.pdf 3.pdf report
...
test 1.pdf 100.pdf report
test 2.pdf 3.pdf report
test 2.pdf 4.pdf report
...
test 98.pdf 99.pdf report
test 98.pdf 100.pdf report
test 99.pdf 100.pdf report
then filter the similar ones out and visually inspect much lower number remaining as reported not matched.
so if 49 = 30 = 1 and 60 = 45 = 25 = 2 but not others then there is only the 1 and 2 to look at closer. Of course there will likely be more and you can use a second opinion on those too.
If you know a likely page number that changes you can exclusively test images of say 3rd page that has a date or other identifying feature.

Related

Is there a way to resize all the pages of a PDF to one size in Python?

Essentially, I'm looking to resize all of the pdf pages in a document to be the same size as the first page (or any set dimensions i.e. A4). This is because it's causing issues for mapping coordinates on a frontend UI I am developing. The result I am hoping for is, that if for example, I have a PDF document with a landscape page, this will be mapped onto an A4 page and take up half the new page. Could anyone point me to any resources or code that might help me do this kind of thing?

disclaimer I am the author of borb, the library used in this answer.
second disclaimer: It's doable, but not easy.
You can use borb to read the PDF. That is the easy part.
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
def main():
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle)
# check whether we have read a Document
assert doc is not None
if __name__ == "__main__":
main()
Now that you have a representation of the Document, you need to obtain the size of the first Page.
pi: PageInfo = doc.get_page(0).get_page_info()
w: Decimal = pi.get_width() or Decimal(0)
h: Decimal = pi.get_height() or Decimal(0)
Now, in every Page (except the first one) you need to update the content stream. The content stream is a sequence of postscript operators that actually renders the content in the PDF.
Luckily for you, there is a command to change the entire coordinate-system of the Page you are working on. This concept is called the transformation matrix.
Every operation will first change its x/y coordinates by applying this 3x3 transformation matrix.
Conversely, by modifying that matrix you are able to scale/translate/rotate all the content inside the Page.
The matrix has this form:
[[ a b 0 ]
[ c d 0 ]
[ e f 1 ]]
The third column is always [0 0 1], so it is not needed.
The Tm command takes 6 arguments (the remaining values) and sets the corresponding values in the transformation matrix.
So you'd need to do something like this:
content_stream = page["Contents"]
instructions: bytes = b"a b c d e f Tm\n" + content_stream["DecodedBytes"]
content_stream[Name("DecodedBytes")] += instructions.encode("latin1")
content_stream[Name("Bytes")] = zlib.compress(content_stream["DecodedBytes"], 9)
content_stream[Name("Length")] = bDecimal(len(content_stream["Bytes"]))

XLRD: Successfully extracted 2 lists out of 2 sheets, but list comparison won't work

Ok so I have two xlsx sheets, both sheets have in their second column, at index 1, a list of sim card numbers. I have successfully printed the contents of both columns into my powershell terminal as 2 lists, and the quantity of elements in those lists, after extracting that data using xlrd.
The first sheet (theirSheet) has 454 entries, the second (ourSheet) has 361. I need to find the 93 that don't exist in the second sheet and put them into (unpaidSims). I could do this manually of course, but I would like to automate this task for the future when I inevitably need to do it again so I am trying to write this python script.
Considering python agrees that I have a list of 454 entries, and a list of 361 entries, I thought I just need to figure out a list comparison and I researched that on Stack Overflow, and tried 3 times with 3 different solutions, but each time, when I use that script to produce the third list (unpaidSims), it says 454...meaning it hasn't removed the entries that are duplicated in the smaller list. Please advise.
from os.path import join, dirname, abspath
import xlrd
theirBookFileName = join(dirname(dirname(abspath(__file__))), 'pycel', 'theirBook.xlsx')
ourBookFileName = join(dirname(dirname(abspath(__file__))), 'pycel', 'ourBook.xlsx')
theirBook = xlrd.open_workbook(theirBookFileName)
ourBook = xlrd.open_workbook(ourBookFileName)
theirSheet = theirBook.sheet_by_index(0)
ourSheet = ourBook.sheet_by_index(0)
theirSimColumn = theirSheet.col(1)
ourSimColumn = ourSheet.col(1)
numColsTheirSheet = theirSheet.ncols
numRowsTheirSheet = theirSheet.nrows
numColsOurSheet = ourSheet.ncols
numRowsOurSheet = ourSheet.nrows
# First Attempt at the comparison, but fails and returns 454 entries from the bigger list
unpaidSims = [d for d in theirSimColumn if d not in ourSimColumn]
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
print "\nWe are expecting 93 entries in this new list"
# Second Attempt at the comparison, but fails and returns 454 entries from the bigger list
s = set(ourSimColumn)
unpaidSims = [x for x in theirSimColumn if x not in s]
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
# Third Attempt at the comparison, but fails and returns 454 entries from the bigger list
unpaidSims = tuple(set(theirSimColumn) - set(ourSimColumn))
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims

According to the xlrd Documentation, the col method returns "a sequence of the Cell objects in the given column".
It doesn't mention anything about comparison of Cell objects. Digging into the source, it appears that they didn't code any comparison methods into the class. As such, the Python documentation states that the objects will be compared by "object identity". In other words, the comparison will be False unless they are the exact same instance of the Cell class, even if the values they contain are identical.
You need to compare the values of the Cells instead. For example:
unpaidSims = set(sim.value for sim in theirSimColumn) - set(sim.value for sim in ourSimColumn)

Reading a binary file using np.fromfile()

I have a binary file that has numerous sections. Each section has its own pattern (i.e. the placement of integers, floats, and strings).
The pattern of each section is known. However, the number of times that pattern occurs within the section is unknown. Each record is in between two same integers. These integers indicate the size of the record. The section name is in between two integer record length variables: 8 and 8. Also within each section, there are multiple records (which are known).
Header
---------------------
Known header pattern
---------------------
8 Section One 8
---------------------
Section One pattern repeating i times
---------------------
8 Section Two 8
---------------------
Section Two pattern repeating j times
---------------------
8 Section Three 8
---------------------
Section Three pattern repeating k times
---------------------
Here was my approach:
Loop through and read each record using f.read(record_length), if the record is 8 bytes, convert to string, this will be the section name.
Then i call: np.fromfile(file,dtype=section_pattern,count=n)
I am calling np.fromfile for each section.
The issue I am having is two fold:
How do I determine n for each section without doing a first pass read?
Reading each record to find a section name seems rather inefficient. Is there a more efficient way to accomplish this?
The section names are always between two integer record variables: 8 and 8.
Here is a sample code, note that in this case i do not have to specify count since the OES section is the last section:
with open('m13.op2', "rb") as f:
filesize = os.fstat(f.fileno()).st_size
f.seek(108,1) # skip header
while True:
rec_len_1 = unpack_int(f.read(4))
record_bytes = f.read(rec_len_1)
rec_len_2 = unpack_int(f.read(4))
record_num = record_num + 1
if rec_len_1==8:
tablename = unpack_string(record_bytes).strip()
if tablename == 'OES':
OES = [
# Top keys
('1','i4',1),('op2key7','i4',1),('2','i4',1),
('3','i4',1),('op2key8','i4',1),('4','i4',1),
('5','i4',1),('op2key9','i4',1),('6','i4',1),
# Record 2 -- IDENT
('7','i4',1),('IDENT','i4',1),('8','i4',1),
('9','i4',1),
('acode','i4',1),
('tcode','i4',1),
('element_type','i4',1),
('subcase','i4',1),
('LSDVMN','i4',1), # Load set number
('UNDEF(2)','i4',2), # Undefined
('LOADSET','i4',1), # Load set number or zero or random code identification number
('FCODE','i4',1), # Format code
('NUMWDE(C)','i4',1), # Number of words per entry in DATA record
('SCODE(C)','i4',1), # Stress/strain code
('UNDEF(11)','i4',11), # Undefined
('THERMAL(C)','i4',1), # =1 for heat transfer and 0 otherwise
('UNDEF(27)','i4',27), # Undefined
('TITLE(32)','S1',32*4), # Title
('SUBTITL(32)','S1',32*4), # Subtitle
('LABEL(32)','S1',32*4), # Label
('10','i4',1),
# Record 3 -- Data
('11','i4',1),('KEY1','i4',1),('12','i4',1),
('13','i4',1),('KEY2','i4',1),('14','i4',1),
('15','i4',1),('KEY3','i4',1),('16','i4',1),
('17','i4',1),('KEY4','i4',1),('18','i4',1),
('19','i4',1),
('EKEY','i4',1), #Element key = 10*EID+Device Code. EID = (Element key)//10
('FD1','f4',1),
('EX1','f4',1),
('EY1','f4',1),
('EXY1','f4',1),
('EA1','f4',1),
('EMJRP1','f4',1),
('EMNRP1','f4',1),
('EMAX1','f4',1),
('FD2','f4',1),
('EX2','f4',1),
('EY2','f4',1),
('EXY2','f4',1),
('EA2','f4',1),
('EMJRP2','f4',1),
('EMNRP2','f4',1),
('EMAX2','f4',1),
('20','i4',1)]
nparr = np.fromfile(f,dtype=OES)
if f.tell() == filesize:
break

Instructables open source code: Python IndexError: list index out of range

I've seen this error on several other questions but couldn't find the answer.
{I'm a complete stranger to Python, but I'm following the instructions from a site and I keep getting this error once I try to run the script:
IndexError: list index out of range
Here's the script:
##//txt to stl conversion - 3d printable record
##//by Amanda Ghassaei
##//Dec 2012
##//http://www.instructables.com/id/3D-Printed-Record/
##
##/*
## * This program is free software; you can redistribute it and/or modify
## * it under the terms of the GNU General Public License as published by
## * the Free Software Foundation; either version 3 of the License, or
## * (at your option) any later version.
##*/
import wave
import math
import struct
bitDepth = 8#target bitDepth
frate = 44100#target frame rate
fileName = "bill.wav"#file to be imported (change this)
#read file and get data
w = wave.open(fileName, 'r')
numframes = w.getnframes()
frame = w.readframes(numframes)#w.getnframes()
frameInt = map(ord, list(frame))#turn into array
#separate left and right channels and merge bytes
frameOneChannel = [0]*numframes#initialize list of one channel of wave
for i in range(numframes):
frameOneChannel[i] = frameInt[4*i+1]*2**8+frameInt[4*i]#separate channels and store one channel in new list
if frameOneChannel[i] > 2**15:
frameOneChannel[i] = (frameOneChannel[i]-2**16)
elif frameOneChannel[i] == 2**15:
frameOneChannel[i] = 0
else:
frameOneChannel[i] = frameOneChannel[i]
#convert to string
audioStr = ''
for i in range(numframes):
audioStr += str(frameOneChannel[i])
audioStr += ","#separate elements with comma
fileName = fileName[:-3]#remove .wav extension
text_file = open(fileName+"txt", "w")
text_file.write("%s"%audioStr)
text_file.close()
Thanks a lot,
Leart

Leart - check these it may help:
Is your input file in correct format? As I see it, you need to produce that file before hand before you can use it in this program... Post that file in here as well.
Check if your bitrate and frame rates are correct
Just for debugging purposes (if the code is correct, this may not produce correct results, but good for testing). You are accessing frameInt[4*i+1], with index i multiplied by 4 then adding 1 (going beyond the frameInt index eventually).
Add an 'if' to check size before accessing the array element in frameInt:
if len(frameInt)>=(4*i+1):
Add that statement right after the first occurence of "for i in range(numframes):" and just before "frameOneChannel[i] = frameInt[4*i+1]*2**8+frameInt[4*i]#separate channels and store one channel in new list"
*watch tab spaces

python similar string removal from multiple files

I have crawled txt files from different website, now i need to glue them into one file. There are many lines are similar to each other from various websites. I want to remove repetitions.
Here is what I have tried:
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
sourcefile = open('%s.txt' % sourcename)
sourcelines = sourcefile.readlines()
sourcefile.close()
for sourceline in sourcelines:
destfile = open('%s.txt' % destname, 'a+')
destlines = destfile.readlines()
similar = False
for destline in destlines:
ratio = difflib.SequenceMatcher(None, destline, sourceline).ratio()
if ratio > 0.8:
print destline
print sourceline
similar = True
if not similar:
destfile.write(sourceline)
destfile.close()
I will run it for every source, and write line by line to the same file. The result is, even if i run it for the same file multiple times, the line is always appended to the destination file.
EDIT:
I have tried the code of the answer. It's still very slow.
Even If I minimize the IO, I still need to compare O(n^2), especially when you have 1000+ lines. I have average 10,000 lines per file.
Any other ways to remove the duplicates?

Here is a short version that does minimal IO and cleans up after itself.
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
with open('%s.txt' % destname, 'w+') as destfile:
# we read in the file so that on subsequent runs of this script, we
# won't duplicate the lines.
known_lines = set(destfile.readlines())
with open('%s.txt' % sourcename) as sourcefile:
for line in sourcefile:
similar = False
for known in known_lines:
ratio = difflib.SequenceMatcher(None, line, known).ratio()
if ratio > 0.8:
print ratio
print line
print known
similar = True
break
if not similar:
destfile.write(line)
known_lines.add(line)
Instead of reading the known lines each time from the file, we save them to a set, which we use for comparison against. The set is essentially a mirror of the contents of 'destfile'.
A note on complexity
By its very nature, this problem has a O(n2) complexity. Because you're looking for similarity with known strings, rather than identical strings, you have to look at every previously seen string. If you were looking to remove exact duplicates, rather than fuzzy matches, you could use a simple lookup in a set, with complexity O(1), making your entire solution have O(n) complexity.
There might be a way to reduce the fundamental complexity by using lossy compression on the strings so that two similar strings compress to the same result. This is however both out of scope for a stack overflow answer, and beyond my expertise. It is an active research area so you might have some luck digging through the literature.
You could also reduce the time taken by ratio() by using the less accurate alternatives quick_ratio() and real_quick_ratio().

Your code works fine for me. it prints destline and sourceline to stdout when lines are similar (in the example I used, exactly the same) but it only wrote unique lines to file once. You might need to set your ratio threshold lower for your specific "similarity" needs.

Basically what you need to do is check every line in the source file to see if it has a potential match against every line of the destination file.
##xiaoshanwujzw.txt
##-----------------
##radically different thing
##this is data
##and more data
##bindresult.txt
##--------------
##a website line
##this is data
##and more data
from difflib import SequenceMatcher
sourcefile = open('xiaoshanwujzw.txt', 'r')
sourcelines = sourcefile.readlines()
sourcefile.close()
destfile = open('bindresult.txt', 'a+')
destlines = destfile.readlines()
has_matches = {k: False for k in sourcelines}
for d_line in destlines:
for s_line in sourcelines:
if SequenceMatcher(None, d_line, s_line).ratio() > 0.8:
has_matches[s_line] = True
break
for k in has_matches:
if has_matches[k] == False:
destfile.write(k)
destfile.close()
This will add the line radically different thing`` to the destinationfile.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.