How to compare two image files contents in python? - python

I want to compare two image files (.png) basically reading two .png files and assert if the contents are equal.
I have tried below,
def read_file_contents(file1, file2):
with open(file1, 'r', errors='ignore') as f:
contents1 = f.readlines()
f.close()
with open(file1, 'r', errors='ignore') as f:
contents2 = f.readlines()
f.close()
return {contents1, contents2}
then to assert if both the contents are equal I use
assert contents1 == contents2
but this gives me assertionerror. could someone help me with this. thanks.

There are multiple ways to accomplish this task using various Python libraries, including numpy & math, imagehash and pillow.
Here is one way (which I modified to only compare 2 images).
# This module is used to load images
from PIL import Image
# This module contains a number of arithmetical image operations
from PIL import ImageChops
def image_pixel_differences(base_image, compare_image):
"""
Calculates the bounding box of the non-zero regions in the image.
:param base_image: target image to find
:param compare_image: set of images containing the target image
:return: The bounding box is returned as a 4-tuple defining the
left, upper, right, and lower pixel coordinate. If the image
is completely empty, this method returns None.
"""
# Returns the absolute value of the pixel-by-pixel
# difference between two images.
diff = ImageChops.difference(base_image, compare_image)
if diff.getbbox():
return False
else:
return True
base_image = Image.open('image01.jpeg')
compare_image = Image.open('image02.jpeg')
results = image_pixel_differences (base_image, compare_image)
I have additional examples, so please let me know if this one does not work for you.

If you just want an exact match, you can compare the bytes directly:
def validate_file_contents(file1, file2):
with open(file1, 'rb', errors='ignore') as f1, open(file2, 'rb', errors='ignore') as f2:
contents1 = f1.read()
contents2 = f2.read()
return contents1 == contents2
You could use an assert if you want, but personally I'd check the True/False condition instead.
You also had a few errors in your code:
The content within the with block is not indented.
In a with block you don't need to close() the files.
You are returning a set of content1 and content2, where if they are actually equal, you will only have 1 item returned. You probably wanted to return (content1, content2) as a tuple.

I don't think using selenium as a tag here is a right choice but w/e.
Images can and are represented as bunch of pixels (basically numbers) arranged in such way that makes them what they are.
The idea is to take those numbers with their arrangement of both pictures and calculate the distance between them, there are multiple ways to do so like MSE.
For the code it self and a further explanation please check out the link below.
https://www.pyimagesearch.com/2014/09/15/python-compare-two-images/
Good luck buddy! (:

Related

How to iterate through csv rows, apply a function to those values and append to new column?

I have a python script which calculates tree heights based off distance and angle from the ground, however, despite the script running with no errors my heights column is left empty. Also, I dont want to be using pandas and I would like to keep to the 'with open' method if possible, before anyone suggests going about it a different way. Any help would be great thanks. It seems that the whole script runs fine and does everything i need it to until the "for row in csvread:" block.
This is my current script:
#!/usr/bin/env python3
# Import any modules needed
import sys
import csv
import math
import os
import itertools
# Extract command line arguments, remove file extension and attach to output_filename
input_filename1 = sys.argv[1]
input_filename2 = os.path.splitext(input_filename1)[0]
filenames = (input_filename2, "treeheights.csv")
output_filename = "".join(filenames)
def TreeHeight(degrees, distance):
"""
This function calculates the heights of trees given distance
of each tree from its base and angle to its top, using the
trigonometric formula.
"""
radians = math.radians(degrees)
height = distance * math.tan(radians)
print("Tree height is:", height)
return height
def main(argv):
with open(input_filename1, 'r') as f:
with open(output_filename, 'w') as g:
csvread = csv.reader(f)
print(csvread)
csvwrite = csv.writer(g)
header = csvread.__next__()
header.append("Height.m")
csvwrite.writerow(header)
# Populating the output csv with the input data
csvwrite.writerows(itertools.islice(csvread, 0, 121))
for row in csvread:
height = TreeHeight(csvread[:,2], csvread[:,1])
row.append(height)
csvwrite.writerow(row)
return 0
if __name__ == "__main__":
status = main(sys.argv)
sys.exit(status)
Looking at your code, I think you're mostly there, but are a little confused on reading/writing rows:
# Populating the output csv with the input data
csvwrite.writerows(itertools.islice(csvread, 0, 121))
for row in csvread:
height = TreeHeight(csvread[:,2], csvread[:,1])
row.append(height)
csvwrite.writerow(row)
It looks like your reading rows 1 through 121 and writing them to your new file. Then, you're trying to iterate over your CSV reader in a second pass, compute the height, and then tack that computed value on to the end of the row, and also write to your CSV in a complete second pass.
If that's true, then you need to understand that CSV reader and writer are not designed to work "left-to-right" like that: read-write these columns, then read-write these columns... nope.
They both work "top-down", processing rows.
I propose, to get this working: iterate every row in one loop, and for every row:
read the values you need from row to compute the height
get the computed height
add the new computed to the original
write
...
header = next(csvread)
header.append("Height.m")
csvwrite.writerow(header)
for row in csvread:
degrees = float(row[1]) # second column for degrees?
distance = float(row[0]) # first column for distance?
height = TreeHeight(degrees, distance)
row.append(height)
csvwrite.writerow(row)
Some changes I made:
I replaced header = csvread.__next__() with header = next(csvread). Calling things that start with _ or __ is generally discouraged, at least in the standard library. next(<iterator>) is the built-in function that allows you to properly and safely advance through <iterator>.
Added float() conversion to textual values as read from CSV
Also, as far as I can tell, the ,2/,1 is incorrect syntax for subscripting/slice notation. You didn't get any errors because the reader was already done/exhausted from the islice() call, so your program never actually stepped into the for row in csvread: loop.

Operation between element in the same list of list generate from imported .dat files

I'm writing a program that takes .dat files from directory one at a time, verifies some condition, and if verification is okay copies the files to another directory.
The code below shows how I import the files and create a list of lists. I'm having trouble with the verification step. I tried with a for loop but when set if condition, operation with elements in the list of lists seems impossible.
In particular I need the difference between consecutive elements matrix[i][3] and matrix[i+1][3] to be less than 5.
for filename in glob.glob(os.path.join(folder_path, '*.dat')):
with open(filename, 'r') as f:
matrix =[]
data = f.readlines()
for raw_line in data:
split_line1= raw_line.replace(":",";")
split_line2= split_line1.replace("\n","")
split_line3 = split_line2.strip().split(";")
matrix.append(split_line3)
Hello and welcome at Stack Overflow.
You did not provide a sample of your data files. After looking at your code, I assume your data looks like this:
9;9;7;5;0;9;5;8;4;2
9;1;1;5;1;3;4;1;8;7
2;8;4;5;5;2;1;4;6;4
6;4;1;5;5;8;1;4;6;1
0;1;0;5;7;1;7;4;1;9
4;9;6;5;3;2;6;2;9;6
8;0;6;0;8;9;3;1;6;6
A few general remarks:
For parsing a csv file, use the csv module. It is easy to use and less error-prone than writing your own parser.
If you do a lot of data-processing and matrix calculations, you want to have a look at the pandas and numpy libraries. Processing matrices line by line in plain Python is slower by some orders of magnitude.
I understand your description of the verification step as follows:
A matrix matches if all consecutive elements
matrix[i][3] and matrix[i+1][3] differ by less than 5.
My suggested code looks like this:
import csv
from glob import glob
from pathlib import Path
def read_matrix(fn):
with open(fn) as f:
c = csv.reader(f, delimiter=";")
m = [[float(c) for c in row] for row in c]
return m
def verify_condition(matrix):
col = 3
pairs_of_consecutive_rows = zip(matrix[:-1], matrix[1:])
for row_i, row_j in pairs_of_consecutive_rows:
if abs(row_i[col] - row_j[col]) >= 5:
return False
return True
if __name__ == '__main__':
folder_path = Path("../data")
for filename in glob(str(folder_path / '*.dat')):
print(f"processsing {filename}")
matrix = read_matrix(filename)
matches = verify_condition(matrix)
if matches:
print("match")
# copy_file(filename, target_folder)
I am not going into detail about the function read_matrix. Just note that I convert the strings to float with the statement float(c) in order to be able to do numerical calculations later on.
I iterate over all consecutive rows by iterating over 'matrix[:-1]and 'matrix[1:] at the same time using zip. See the effect of zip in this example:
>>> list(zip("ABC", "XYZ"))
[('A', 'X'), ('B', 'Y'), ('C', 'Z')]
And the effect of the [:-1] and [1:] indices here:
>>> "ABC"[:-1], "ABC"[1:]
('AB', 'BC')
When verify_condition finds the first two consecutive rows that differ by at least 5, it returns false.
I am confident that this code should help you going on.
PS: I could not resist using the pathlib library because I really prefer to see code like folder / subfolder / "filename.txt" instead of path.join(folder, subfolder, "filename.txt") in my scripts.

Does Python have a standard PTS reader or parser?

I have the following file:
version: 1
n_points: 68
{
55.866278 286.258077
54.784191 315.123248
62.148364 348.908294
83.264019 377.625584
102.690421 403.808995
125.495327 438.438668
140.698598 471.379089
158.435748 501.785631
184.471278 511.002579
225.857960 504.171628
264.555990 477.159805
298.168768 447.523374
332.502678 411.220089
350.641672 372.839985
355.004106 324.781552
349.265206 270.707703
338.314674 224.205227
33.431075 238.262266
42.204378 227.503948
53.939564 227.904931
68.298209 232.202002
82.271511 239.951519
129.480996 229.905585
157.960824 211.545631
189.465597 204.068108
220.288164 208.206246
249.905282 218.863196
110.089281 266.422557
108.368067 298.896910
105.018473 331.956957
102.889410 363.542719
101.713553 379.256535
114.636047 383.331785
129.543556 384.250352
140.033133 375.640569
152.523364 366.956846
60.326871 270.980865
67.198221 257.376350
92.335775 259.211865
102.394658 274.137548
86.227917 277.162353
68.397650 277.343621
165.340638 263.379230
173.385917 246.412765
198.024842 240.895985
223.488685 247.333206
207.218336 260.967007
184.619159 265.379884
122.903148 418.405102
114.539655 407.643816
123.642553 404.120397
136.821841 407.806210
149.926926 403.069590
196.680098 399.302500
221.946232 394.444167
203.262878 417.808844
164.318232 440.472370
145.915650 444.015386
136.436942 442.897031
125.273506 429.073840
124.666341 420.331816
130.710965 421.709666
141.438004 423.161457
155.870784 418.844649
213.410389 396.978046
155.870784 418.844649
141.438004 423.161457
130.710965 421.709666
}
The file extension is .pts.
Is there some standard reader for this file?
The code I have (downloaded from some github) which tries to read it is
landmark = np.loadtxt(image_landmarks_path)
which fails on
{ValueError}could not convert string to float: 'version:'
which makes sense.
I can't change the file, and wonder if i have to write my own parser or is this some standard?
It appears to be a 2D point cloud file, I think it's called the Landmark PTS format, the closest Python reference I could find is for a 3D-morphable face model-fitting library issue, which references a sample file that matches yours. Most .pts point cloud tools expect to work with 3D files so may not work out of the box with this one.
So no, there doesn't appear to be a standard reader for this; the closest I came to a library that reads the format is this GitHub repository, but it has drawback: it reads all data into memory before manually parsing it into Python float values.
However, the format is very simple (as the referenced issue notes), and so you can read the data just using numpy.loadtxt(); the simplistic approach is to just name all those non-data lines as comments:
def read_pts(filename):
return np.loadtxt(filename, comments=("version:", "n_points:", "{", "}"))
or, if you are not sure about the validity of a bunch of such files and you'd want to ensure you only read valid files, then you could pre-process the file to read the header (including number of points and version validation, allowing for comments and image size info):
from pathlib import Path
from typing import Union
import numpy as np
def read_pts(filename: Union[str, bytes, Path]) -> np.ndarray:
"""Read a .PTS landmarks file into a numpy array"""
with open(filename, 'rb') as f:
# process the PTS header for n_rows and version information
rows = version = None
for line in f:
if line.startswith(b"//"): # comment line, skip
continue
header, _, value = line.strip().partition(b':')
if not value:
if header != b'{':
raise ValueError("Not a valid pts file")
if version != 1:
raise ValueError(f"Not a supported PTS version: {version}")
break
try:
if header == b"n_points":
rows = int(value)
elif header == b"version":
version = float(value) # version: 1 or version: 1.0
elif not header.startswith(b"image_size_"):
# returning the image_size_* data is left as an excercise
# for the reader.
raise ValueError
except ValueError:
raise ValueError("Not a valid pts file")
# if there was no n_points line, make sure the closing } line
# is not going to trip up the numpy reader by marking it as a comment
points = np.loadtxt(f, max_rows=rows, comments="}")
if rows is not None and len(points) < rows:
raise ValueError(f"Failed to load all {rows} points")
return points
That function is as production-ready as I can make it, apart from providing a full test suite.
This uses the n_points: line to tell np.loadtxt() how many rows to read, and moves the file position forward to just past the { opener. It'll also exit with a ValueError if there is no version: 1 line present or if there is anything other than version: 1 and n_points: <int> in the header.
Both produce a 68x2 matrix of float64 values but should be able to work with any dimension of points.
Circling back to that EOS library reference, their demo code to read the data hand-parses the lines, also by reading all lines into memory first. I also found this Facebook Research PTS dataset loading code (for .pts files with 3 values per line), which is just as manual.

How do I read a text file of numbers into an array of arrays

In python, using the OpenCV library, I need to create some polylines. The example code for the polylines method shows:
cv2.polylines(img,[pts],True,(0,255,255))
I have all the 'pts' laid out in a text file in the format:
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
How can I read this file and provide the data to the [pts] variable in the method call?
I've tried the np.array(csv.reader(...)) method as well as a few others I've found examples of. I can successfully read the file, but it's not in the format the polylines method wants. (I am a newbie when it comes to python, if this was C++ or Java, it wouldn't be a problem).
I would try to use numpy to read the csv as an array.
from numpy import genfromtxt
p = genfromtxt('myfile.csv', delimiter=',')
cv2.polylines(img,p,True,(0,255,255))
You may have to pass a dtype argument to the genfromtext if you need to coerce the data to a specific format.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
In case you know it is a fixed number of items in each row:
import csv
with open('myfile.csv') as csvfile:
rows = csv.reader(csvfile)
res = list(zip(*rows))
print(res)
I know it's not pretty and there is probably a MUCH BETTER way to do this, but it works. That being said, if someone could show me a better way, it would be much appreciated.
pointlist = []
f = open(args["slots"])
data = f.read().split()
for row in data:
tmp = []
col = row.split(";")
for points in col:
xy = points.split(",")
tmp += [[int(pt) for pt in xy]]
pointlist += [tmp]
slots = np.asarray(pointlist)
You might need to draw each polyline individually (to expand on #Chris's answer):
from numpy import genfromtxt
lines = genfromtxt('myfile.csv', delimiter=',')
for line in lines:
cv2.polylines(img, line.reshape((-1, 2)), True, (0,255,255))

python similar string removal from multiple files

I have crawled txt files from different website, now i need to glue them into one file. There are many lines are similar to each other from various websites. I want to remove repetitions.
Here is what I have tried:
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
sourcefile = open('%s.txt' % sourcename)
sourcelines = sourcefile.readlines()
sourcefile.close()
for sourceline in sourcelines:
destfile = open('%s.txt' % destname, 'a+')
destlines = destfile.readlines()
similar = False
for destline in destlines:
ratio = difflib.SequenceMatcher(None, destline, sourceline).ratio()
if ratio > 0.8:
print destline
print sourceline
similar = True
if not similar:
destfile.write(sourceline)
destfile.close()
I will run it for every source, and write line by line to the same file. The result is, even if i run it for the same file multiple times, the line is always appended to the destination file.
EDIT:
I have tried the code of the answer. It's still very slow.
Even If I minimize the IO, I still need to compare O(n^2), especially when you have 1000+ lines. I have average 10,000 lines per file.
Any other ways to remove the duplicates?
Here is a short version that does minimal IO and cleans up after itself.
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
with open('%s.txt' % destname, 'w+') as destfile:
# we read in the file so that on subsequent runs of this script, we
# won't duplicate the lines.
known_lines = set(destfile.readlines())
with open('%s.txt' % sourcename) as sourcefile:
for line in sourcefile:
similar = False
for known in known_lines:
ratio = difflib.SequenceMatcher(None, line, known).ratio()
if ratio > 0.8:
print ratio
print line
print known
similar = True
break
if not similar:
destfile.write(line)
known_lines.add(line)
Instead of reading the known lines each time from the file, we save them to a set, which we use for comparison against. The set is essentially a mirror of the contents of 'destfile'.
A note on complexity
By its very nature, this problem has a O(n2) complexity. Because you're looking for similarity with known strings, rather than identical strings, you have to look at every previously seen string. If you were looking to remove exact duplicates, rather than fuzzy matches, you could use a simple lookup in a set, with complexity O(1), making your entire solution have O(n) complexity.
There might be a way to reduce the fundamental complexity by using lossy compression on the strings so that two similar strings compress to the same result. This is however both out of scope for a stack overflow answer, and beyond my expertise. It is an active research area so you might have some luck digging through the literature.
You could also reduce the time taken by ratio() by using the less accurate alternatives quick_ratio() and real_quick_ratio().
Your code works fine for me. it prints destline and sourceline to stdout when lines are similar (in the example I used, exactly the same) but it only wrote unique lines to file once. You might need to set your ratio threshold lower for your specific "similarity" needs.
Basically what you need to do is check every line in the source file to see if it has a potential match against every line of the destination file.
##xiaoshanwujzw.txt
##-----------------
##radically different thing
##this is data
##and more data
##bindresult.txt
##--------------
##a website line
##this is data
##and more data
from difflib import SequenceMatcher
sourcefile = open('xiaoshanwujzw.txt', 'r')
sourcelines = sourcefile.readlines()
sourcefile.close()
destfile = open('bindresult.txt', 'a+')
destlines = destfile.readlines()
has_matches = {k: False for k in sourcelines}
for d_line in destlines:
for s_line in sourcelines:
if SequenceMatcher(None, d_line, s_line).ratio() > 0.8:
has_matches[s_line] = True
break
for k in has_matches:
if has_matches[k] == False:
destfile.write(k)
destfile.close()
This will add the line radically different thing`` to the destinationfile.

Categories

Resources