Python Image hashing - python

I'm currently trying to get a hash from an image in python, i have successfully done this and it works somewhat.
However, I have this issue:
Image1 and image2 end up having the same hash, even though they are different. I need a form of hashing which is more accurate and precise.
Image1 = Image1
Image2 = Image2
The hash for the images is: faf0761493939381
I am currently using from PIL import Image
import imagehash
And imagehash.average_hash
Code here
import os
from PIL import Image
import imagehash
def checkImage():
for filename in os.listdir('images//'):
hashedImage = imagehash.average_hash(Image.open('images//' + filename))
print(filename, hashedImage)
for filename in os.listdir('checkimage//'):
check_image = imagehash.average_hash(Image.open('checkimage//' + filename))
print(filename, check_image)
if check_image == hashedImage:
print("Same image")
else:
print("Not the same image")
print(hashedImage, check_image)
checkImage()

Try using hashlib. Just open the file and perform a hash.
import hashlib
# Simple solution
with open("image.extension", "rb") as f:
hash = hashlib.sha256(f.read()).hexdigest()
# General-purpose solution that can process large files
def file_hash(file_path):
# https://stackoverflow.com/questions/22058048/hashing-a-file-in-python
sha256 = hashlib.sha256()
with open(file_path, "rb") as f:
while True:
data = f.read(65536) # arbitrary number to reduce RAM usage
if not data:
break
sha256.update(data)
return sha256.hexdigest()
Thanks to AntonĂ­n Hoskovec for pointing out that it should be read binary (rb), not simple read (r)!

By default, imagehash checks if image files are nearly identical. The files you are comparing are more similar than they are not. If you want a more or less unique way of fingerprinting files you can use a different approach, such as employing a cryptographic hashing algorithm:
import hashlib
def get_hash(img_path):
# This function will return the `md5` checksum for any input image.
with open(img_path, "rb") as f:
img_hash = hashlib.md5()
while chunk := f.read(8192):
img_hash.update(chunk)
return img_hash.hexdigest()

Related

recommended way to download images in python requests

I see that there are two ways to download images using python-reuqests.
Uisng PIL as stated in docs (https://requests.readthedocs.io/en/master/user/quickstart/#binary-response-content):
from PIL import Image
from io import BytesIO
i = Image.open(BytesIO(r.content))
using streamed response content:
r = requests.get(url, stream=True)
with open(image_name, 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
Which is the recommended wya to download images however? both have its merits I suyppose, and I was wondering what is the optimal approach.
I love the minimalist way. There is nothing called right way. It depends on the task you want to perform and the constraints you have.
import requests
with open('file.png', 'wb') as f:
f.write(requests.get(url).content)
# if you change png to jpg, there will be no error
I did use the below lines of code in a function to save images.
# import the required libraries from Python
import pathlib,urllib.request,os,uuid
# URL of the image you want to download
image_url = "https://example.com/image.png"
# Using the uuid generate new and unique names for your images
filename = str(uuid.uuid4())
# Strip the image extension from it's original name
file_ext = pathlib.Path(image_url).suffix
# Join the new image name to the extension
picture_filename = filename + file_ext
# Using pathlib, specify where the image is to be saved
downloads_path = str(pathlib.Path.home() / "Downloads")
# Form a full image path by joining the path to the
# images' new name
picture_path = os.path.join(downloads_path, picture_filename)
# Using "urlretrieve()" from urllib.request save the image
urllib.request.urlretrieve(image_url, picture_path)
# urlretrieve() takes in 2 arguments
# 1. The URL of the image to be downloaded
# 2. The image new name after download. By default, the image is
# saved inside your current working directory

Calculate CRC32, MD5 and SHA1 of zip content without decompression in Python

I need to calculate the CRC32, MD5 and SHA1 of the content of zip files without decompressing them.
So far I found out how to calculate these for the zip files itself, e.g.:
CRC32:
import zlib
zip_name = "test.zip"
def Crc32Hasher(file_path):
buf_size = 65536
crc32 = 0
with open(file_path, 'rb') as f:
while True:
data = f.read(buf_size)
if not data:
break
crc32 = zlib.crc32(data, crc32)
return format(crc32 & 0xFFFFFFFF, '08x')
print(Crc32Hasher(zip_name))
SHA1: (MD5 similarly)
import hashlib
zip_name = "test.zip"
def Sha1Hasher(file_path):
buf_size = 65536
sha1 = hashlib.sha1()
with open(file_path, 'rb') as f:
while True:
data = f.read(buf_size)
if not data:
break
sha1.update(data)
return format(sha1.hexdigest())
print(Sha1Hasher(zip_name))
For the content of the zip file, I can read the CRC32 from the zip directly without the need of calculating it as follow:
Read CRC32 of zip content:
import zipfile
zip_name = "test.zip"
if zip_name.lower().endswith(('.zip')):
z = zipfile.ZipFile(zip_name, "r")
for info in z.infolist():
print(info.filename,
format(info.CRC & 0xFFFFFFFF, '08x'))
But I couldn't figure out how to calculate the SHA1 (or MD5) of the content of zip files without decompressing them first.
Is that somehow possible?
It is not possible. You can get CRC because it was carefully precalculated for you when archive is created (it is used for integrity check). Any other checksum/hash has to be calculated from scratch and will require at least streaming of the archive content, i.e. unpacking.
UPD: Possibble implementations
libarchive: extra dependencies, supports many archive formats
import libarchive.public as libarchive
with libarchive.file_reader(fname) as archive:
for entry in archive:
md5 = hashlib.md5()
for block in entry.get_blocks():
md5.update(block)
print(str(entry), md5.hexdigest())
Native zipfile: no dependencies, zip only
import zipfile
archive = zipfile.ZipFile(fname)
blocksize = 1024**2 #1M chunks
for fname in archive.namelist():
entry = archive.open(fname)
md5 = hashlib.md5()
while True:
block = entry.read(blocksize)
if not block:
break
md5.update(block)
print(fname, md5.hexdigest())

Python: Generating a MD5 Hash of a file using Hashlib

I am trying to generate hashes of files using hashlib inside Tkinter modules.
My goal:
Step 1:- Button (clicked), opens up a browser (click file you want a hash of).
Step 2:- Once file is chosen, choose output file (.txt) where the hash will be 'printed'.
Step 3:- Repeat and have no clashes.
from tkinter.filedialog import askopenfilename
import hashlib
def hashing():
hash = askopenfilename(title="Select file for Hashing")
savename = askopenfilename(title="Select output")
outputhash = open(savename, "w")
hash1 = open(hash, "r")
h = hashlib.md5()
print(h.hexdigest(), file=outputhash)
love.flush()
It 'works' to some extent, it allows an input file and output file to be selected. It prints the hash into the output file.
HOWEVER - If i choose ANY different file, i get the same hash everytime.
Im new to Python and its really stumping me.
Thanks in advance.
Thanks for all your comments.
I figured the problem and this is my new code:
from tkinter.filedialog import askopenfilename
import hashlib
def hashing():
hash = askopenfilename(title="Select file for Hashing")
savename = askopenfilename(title="Select output")
outputhash = open(savename, "w")
curfile = open(hash, "rb")
hasher = hashlib.md5()
buf = curfile.read()
hasher.update(buf)
print(hasher.hexdigest(), file=outputhash)
outputhash.flush()
This code works, You guys rock. :)
In your case you do the digest of the empty string and probably you get:
d41d8cd98f00b204e9800998ecf8427e
I used this method to digest, that is better for big files (see here).
md5 = hashlib.md5()
with open(File, "rb") as f:
for block in iter(lambda: f.read(128), ""):
md5.update(block)
print(md5.hexdigest())
A very simple way
from hashlib import md5
f=open("file.txt","r")
data=f.read()
f.close()
Hash=md5(data).hexdigest()
out=open("out.txt","w")
out.write(Hash)
out.close()

Get Binary Representation of PIL Image Without Saving

I am writing an application that uses images intensively. It is composed of two parts. The client part is written in Python. It does some preprocessing on images and sends them over TCP to a Node.js server.
After preprocessing, the Image object looks like this:
window = img.crop((x,y,width+x,height+y))
window = window.resize((48,48),Image.ANTIALIAS)
To send that over socket, I have to have it in binary format. What I am doing now is:
window.save("window.jpg")
infile = open("window.jpg","rb")
encodedWindow = base64.b64encode(infile.read())
#Then send encodedWindow
This is a huge overhead, though, since I am saving the image to the hard disk first, then loading it again to obtain the binary format. This is causing my application to be extremely slow.
I read the documentation of PIL Image, but found nothing useful there.
According to the documentation, (at effbot.org):
"You can use a file object instead of a filename. In this case, you must always specify the format. The file object must implement the seek, tell, and write methods, and be opened in binary mode."
This means you can pass a StringIO object. Write to it and get the size without ever hitting the disk.
Like this:
s = StringIO.StringIO()
window.save(s, "jpg")
encodedWindow = base64.b64encode(s.getvalue())
use BytesIO
from io import BytesIO
from PIL import Image
photo=Image.open('photo.jpg')
s=BytesIO()
photo.save(s,'jpeg')
data = s.getvalue()
with open('photo2.jpg', mode='wb') as f:
f.write(data)
It's about the difference between in-memory file-like object and BufferedReader object.
Here is my experiment in Jupyter(Python 3.8.10):
from PIL import Image as PILImage, ImageOps as PILImageOps
from IPython.display import display, Image
from io import BytesIO
import base64
url = "https://learn.microsoft.com/en-us/archive/msdn-magazine/2018/april/images/mt846470.0418_mccaffreytrun_figure2_hires(en-us,msdn.10).png"
print("get computer-readable bytes from the url")
img_bytes = requests.get(url).content
print(type(img_bytes))
display(Image(img_bytes))
print("convert to in-memory file-like object")
in_memory_file_like_object = BytesIO(img_bytes)
print(type(in_memory_file_like_object))
print("convert to an PIL Image object for manipulating")
pil_img = PILImage.open(in_memory_file_like_object)
print("let's rotate it, and it remains a PIL Image object")
pil_img.show()
rotated_img = pil_img.rotate(45)
print(type(rotated_img))
print("let's create an in-memory file-like object and save the PIL Image object into it")
in_memory_file_like_object = BytesIO()
rotated_img.save(in_memory_file_like_object, 'png')
print(type(in_memory_file_like_object))
print("get computer-readable bytes")
img_bytes = in_memory_file_like_object.getvalue()
print(type(img_bytes))
display(Image(img_bytes))
print('convert to base64 to be transmitted over channels that do not preserve all 8-bits of data, such as email')
# https://stackoverflow.com/a/8909233/3552975
base_64 = base64.b64encode(img_bytes)
print(type(base_64))
# https://stackoverflow.com/a/45928164/3552975
assert base64.b64encode(base64.b64decode(base_64)) == base_64
In short you can save a PIL Image object into an in-memory file-like object by rotated_img.save(in_memory_file_like_object, 'png') as shown above, and then conver the in-memory file-like object into base64.
from io import BytesIO
b = BytesIO()
img.save(b, format="png")
b.seek(0)
data = b.read()
del b

In Memory Image to Zipfile

I wrote this code :
with Image.open(objective.picture.read()) as image:
image_file = BytesIO()
exifdata = image.info['exif']
image.save(image_file, 'JPEG', quality=50, exif=exifdata)
zf.writestr(zipped_filename, image_file)
Which is supposed to open the image stored in my model (this is in a Django application). I want to reduce the quality of the image file before adding it to the zipfile (zf). So I decided to work with BytesIO to prevent writing useless file on the disk. Though I'm getting an error here. It says :
embedded NUL character
Could someone help me out with this ? I don't understand what's going on.
Well I was kind of dumb. objective.picture.read() returns a byte string (really long byte string...) so I shouldn't have used Image but ImageFile.Parser() and feed that byte string to the parser so it can return an Image that I can work with. Here is the code :
from PIL import ImageFile
from io import BytesIO
p = ImageFile.Parser()
p.feed(objective.picture.read())
image = p.close()
image_file = BytesIO()
exifdata = image.info['exif']
image.save(image_file, 'JPEG', quality=50, exif=exifdata)
# Here zf is a zipfile writer
zf.writestr(zipped_filename, image_file.getvalue())
The close() actually returns the image parsed from the bytestring.
Here is the doc : The ImageFile Documentation

Categories

Resources