Python hash of identical files comes out different

Python hash of identical files comes out different - python

I've written a simple Python script to copy a file from one place to another. (It's for class, so that's why I'm not using something simpler like shutil. I have a check at the end that compares the hash of the two files, and it consistently tells me they're different, even though the copying is successful - both are text files that say "hello world".
Here is my code:
import os
def validity_checker(address1, dest_name):
try:
src = open(address1, 'rb')
dest = open(dest_name, 'wb+')
except IOError:
return False
return True
def copaste(address1, address2):
# concatenate address2 into filename
file_ending = address1.split('\\').pop()
dest_name = address2 + '\\' + file_ending
# copy file after calling checker
if validity_checker(address1, dest_name):
src = open(address1, 'rb')
dest = open(dest_name, 'wb+')
contents = src.read()
dest.write(contents)
src.close()
dest.close()
else:
print("File name bad. No action taken")
print src
print dest
print(hash(src)) #hash the file not the string
print(hash(dest))
return
And the output:
<closed file 'C:\\Users\\user\\Downloads\\hello.txt', mode 'rb' at 0x04B7D1D8>
<closed file 'C:\\Users\\user\\Downloads\\dest\\hello.txt', mode 'wb+' at 0x04C2B860>
-2042961099
4991878
Plus the file is copied.
I'm fairly sure the hash is checking the file itself, not the string. Is it maybe something to do with metadata? Any help would be greatly appreciated.

You are using the Python-specific hash() function, which calculates a hash for use on dictionary keys and set contents.
For file objects, the hash() is based on the object identity; you can't base it on anything else because two distinct file objects are never equal, the fileobject.__eq__ method returns True only if both objects are one and the same in memory (so is would be true too). The file contents, the name of the file, the mode or any of the other object attributes play no role in the hash value produced.
From the function documentation:
Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup.
If you need to validate that the file copy contains the same data, you need to hash the •file contents* using a cryptographic hash function, which is something completely different. Use the hashlib module; for your usecase the simple and fast MD5 algorithm will do:
for closed_file in (src, dest):
with open(closed_file.name, 'rb') as reopened: # opened in binary mode!
print(reopened.name)
print(hashlib.md5(reopened.read()).hexdigest())
If the binary contents of the two files is exactly the same, then their cryptographic hash will also be the same.

You are getting the python hash of the file object. Not the contents of the file. As a minimum you should
print(hash(open(address1, 'rb').read())
print(hash(open(dest_name, 'rb').read())
But since this still risks collisions, you should do as Martijn suggests and use a hashlib function.

Related

Python--best way to use "open" command with in-memory str

I have a library I need to call that takes a local file path as input and runs open(local_path, 'rb'). However, I don't have a local file--I have an in memory text string. Right now I am writing that to a temp file and passing that, but it seems wasteful. Is there a better way to do this, given that I need to be able to run open(local_path, 'rb') on it?
Current code:
text = "Some text"
temp = tempfile.TemporaryFile(delete=False)
temp.write(bytes(text, 'UTF-8'))
temp.seek(0)
temp.close()
#call external lib here, passing in temp.name as the local_path input
Later, inside the lib I need to use (I can't edit this):
with open(local_path, 'rb') as content_file:
file_content = content_file.read()

Since the function you call in turn calls open() with the passed parameter, you must give it a str or a PathLike. This means you basically need a file which exists in the file system. You won't be able to pass an in-memory object like I was originally thinking.
Original answer:
I suggest looking at the io package. Specifically, StringIO provides a file-like wrapper on an in-memory string object. If you need binary, then try BytesIO.

Python hash from file not as expected

My problem is the following:
I want to create a little tool in Python that creates hash values for entered text or from files. I've created all necessary things, GUI, option to select between hash functions, everything is fine.
But when I was testing the program, I realized, that the from files generated hashes aren't the same as the ones given by most download pages. I was confused, downloaded some other hashing tools, they all gave me the same hash as provided on several websites, but my tool always give some other output.
The odd thing is, the hashes generated from "plain text" are in my and in all other tools identical.
The app uses wxPython, but I've extracted my hash function for hash creation from files:
import os, hashlib
path = "C:\file.txt" # Given from some open file dialog, valid file
text = ""
if os.path.isfile(path):
text_file = open(path, "r")
text = text_file.read()
text_file.close()
print hashlib.new("md5", text).hexdigest() # Could be any hash function
Quite simple, but doesn't work as expected.
It seems to work if there's no new line in the file (\n)?
But how to make it work with newline? It's like every file has more than one line.

It is a problem of quoting the backslash character, see https://docs.python.org/2/reference/lexical_analysis.html#literals. Use two backslashes to specify the file name. I would also recommend reading the file in binary mode. As a precaution, print the length of variable text to make sure the file was read.
import os, hashlib
path = "C:\\file.txt" # Given from some open file dialog, valid file
text = ""
if os.path.isfile(path):
text_file = open(path, "rb")
text = text_file.read()
text_file.close()
print len(text)
print hashlib.new("md5", text).hexdigest() # Could be any hash function

Try splitting text update and md5 object creation as below
import hashlib;
md5=hashlib.new('md5')
with open(filepath,'rb') as f:
for line in f:
md5.update(line)
return md5.hexdigest()

Python - mechanism to identify compressed file type and uncompress

A compressed file can be classified into below logical groups
a. The operating system which you are working on (*ix, Win) etc.
b. Different types of compression algorithm (i.e .zip,.Z,.bz2,.rar,.gzip). Atleast from a standard list of mostly used compressed files.
c. Then we have tar ball mechanism - where I suppose there are no compression. But it acts more like a concatenation.
Now, if we start addressing the above set of compressed files,
a. Option (a) would be taken care by python since it is platform independent language.
b. Option (b) and (c) seems to have a problem.
What do I need
How do I identify the file type (compression type) and then UN-compress them?
Like:
fileType = getFileType(fileName)
switch(fileType):
case .rar: unrar....
case .zip: unzip....
etc
So the fundamental question is how do we identify the compression algorithm based on the file (assuming the extension is not provided or incorrect)? Is there any specific way to do it in python?

This page has a list of "magic" file signatures. Grab the ones you need and put them in a dict like below. Then we need a function that matches the dict keys with the start of the file. I've written a suggestion, though it can be optimized by preprocessing the magic_dict into e.g. one giant compiled regexp.
magic_dict = {
"\x1f\x8b\x08": "gz",
"\x42\x5a\x68": "bz2",
"\x50\x4b\x03\x04": "zip"
}
max_len = max(len(x) for x in magic_dict)
def file_type(filename):
with open(filename) as f:
file_start = f.read(max_len)
for magic, filetype in magic_dict.items():
if file_start.startswith(magic):
return filetype
return "no match"
This solution should be cross-plattform and is of course not dependent on file name extension, but it may give false positives for files with random content that just happen to start with some specific magic bytes.

Based on lazyr's answer and my comment, here is what I mean:
class CompressedFile (object):
magic = None
file_type = None
mime_type = None
proper_extension = None
def __init__(self, f):
# f is an open file or file like object
self.f = f
self.accessor = self.open()
#classmethod
def is_magic(self, data):
return data.startswith(self.magic)
def open(self):
return None
import zipfile
class ZIPFile (CompressedFile):
magic = '\x50\x4b\x03\x04'
file_type = 'zip'
mime_type = 'compressed/zip'
def open(self):
return zipfile.ZipFile(self.f)
import bz2
class BZ2File (CompressedFile):
magic = '\x42\x5a\x68'
file_type = 'bz2'
mime_type = 'compressed/bz2'
def open(self):
return bz2.BZ2File(self.f)
import gzip
class GZFile (CompressedFile):
magic = '\x1f\x8b\x08'
file_type = 'gz'
mime_type = 'compressed/gz'
def open(self):
return gzip.GzipFile(self.f)
# factory function to create a suitable instance for accessing files
def get_compressed_file(filename):
with file(filename, 'rb') as f:
start_of_file = f.read(1024)
f.seek(0)
for cls in (ZIPFile, BZ2File, GZFile):
if cls.is_magic(start_of_file):
return cls(f)
return None
filename='test.zip'
cf = get_compressed_file(filename)
if cf is not None:
print filename, 'is a', cf.mime_type, 'file'
print cf.accessor
Can now access the compressed data using cf.accessor. All the modules provide similar methods like 'read()', 'write()', etc. to do this.

This is a complex question that depends on a number of factors: the most important being how portable your solution needs to be.
The basics behind finding the file type given a file is to find an identifying header in the file, usually something called a "magic sequence" or signature header, which identifies that a file is of a certain type. Its name or extension is usually not used if it can be avoided. For some files, Python has this built in. For example, to deal with .tar files, you can use the tarfile module, which has a convenient is_tarfile method. There is a similar module named zipfile. These modules will also let you extract files in pure Python.
For example:
f = file('myfile','r')
if zipfile.is_zipfile(f):
zip = zipfile.ZipFile(f)
zip.extractall('/dest/dir')
elif tarfile.is_tarfile(f):
...
If your solution is Linux or OSX only, there is also the file command which will do a lot of the work for you. You can also use the built-in tools to uncompress the files. If you are just doing a simple script, this method is simpler and will give you better performance.

The accepted solution looks great, but it doesn't work with python-3, here are the modifications that made it work -- using binary I/O instead of strings:
magic_dict = {
b"\x1f\x8b\x08": "gz",
b"\x42\x5a\x68": "bz2",
b"\x50\x4b\x03\x04": "zip"
}
''' SKIP '''
with open(filename, "rb") as f:
''' The rest is the same '''

"a" is completely false.
"b" can be easily interpreted badly, as ".zip" doesn't mean the file is actually a zip file. It could be a JPEG with zip extension (for confusing purposes, if you want).
You actually need to check if the data inside the file matches the data expected to have by it's extension.
Also have a look at magic byte.

If the exercise is to identify it just to label files, you have lots of answers. If you want to uncompress the archive, why don't you just try and catch the execptions/errors? For example:
>>> tarfile.is_tarfile('lala.txt')
False
>>> zipfile.is_zipfile('lala.txt')
False
>>> with bz2.BZ2File('startup.bat','r') as f:
... f.read()
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IOError: invalid data stream

2019 update:
I was looking for a solution to detect if a .csv file was gzipped or not. The answer #Lauritz gave was throwing errors for me, i imagine it's just because the way files are read has changed in the past 7 years.
This library worked perfectly for me!
https://pypi.org/project/filetype/

how to compare values in an existing dictionary and update the dictionary back to a file?

I am making an utility of sorts with dictionary. What I am trying to achieve is this:
for each XML file that I parse, the existing dictionary is loaded from a file (output.dict) and compared/updated for the current key and stored back along with existing values. I tried with has_key() and attributerror, it does not work.
Since I trying one file at a time, it creates multiple dictionaries and am unable to compare. This is where I am stuck.
def createUpdateDictionary(servicename, xmlfile):
dictionary = {}
if path.isfile == 'output.dict':
dictionary.update (eval(open('output.dict'),'r'))
for event, element in etree.iterparse(xmlfile):
dictionary.setdefault(servicename, []).append(element.tag)
f = open('output.dict', 'a')
write_dict = str(dictionary2)
f.write(write_dict)
f.close()
(here the servicename is nothing but a split '.' of xmlfile which forms the key and values are nothing by the element's tag name)

def createUpdateDictionary(servicename, xmlfile):
dictionary = {}
if path.isfile == 'output.dict':
dictionary.update (eval(open('output.dict'),'r'))
There is a typo, as the 'r' argument belongs to open(), not eval(). Furthermore, you cannot evaluate a file object as returned by open(), you have to read() the contents first.
f = open('output.dict', 'a')
write_dict = str(dictionary2)
f.write(write_dict)
f.close()
Here, you are appending the string representation to the file. The string representation is not guaranteed to represent the dictionary completely. It is meant to be readable by humans to allow inspection, not to persist the data.
Moreover, since you are using 'a' to append the data, you are storing multiple copies of the updated dictionary in the file. Your file might look like:
{}{"foo": []}{"foo": [], "bar":[]}
This is clearly not what you want; you won't even by able to eval() it later (syntax error!).
Since eval() will execute arbitrary Python code, it is considered evil and you really should not use it for object serialization. Either use pickle, which is the standard way of serialization in Python, or use json, which is a human-readable standard format supported by other languages as well.
import json
def createUpdateDictionary(servicename, xmlfile):
with open('output.dict', 'r') as fp:
dictionary = json.load(fp)
# ... process XML, update dictionary ...
with open('output.dict', 'w') as fp:
json.dump(dictionary, fp)

Python MD5 Hash comparison in Python 3.2

I am trying to validate two files downloaded from a server. The first contains data and the second file contains the MD5 hash checksum.
I created a function that returns a hexdigest from the data file like so:
def md5(fileName):
"""Compute md5 hash of the specified file"""
try:
fileHandle = open(fileName, "rb")
except IOError:
print ("Unable to open the file in readmode: [0]", fileName)
return
m5Hash = hashlib.md5()
while True:
data = fileHandle.read(8192)
if not data:
break
m5Hash.update(data)
fileHandle.close()
return m5Hash.hexdigest()
I compare the files using the following:
file = "/Volumes/Mac/dataFile.tbz"
fileHash = md5(file)
hashFile = "/Volumes/Mac/hashFile.tbz.md5"
fileHandle = open(hashFile, "rb")
fileHandleData = fileHandle.read()
if fileHash == fileHandleData:
print ("Good")
else:
print ("Bad")
The file comparison fails so I printed out both fileHash and fileHandleData and I get the following:
[0] b'MD5 (hashFile.tbz) = b60d684ab4a2570253961c2c2ad7b14c\n'
[0] b60d684ab4a2570253961c2c2ad7b14c
From the output above the hash values are identical. Why does the hash comparison fail? I am new to python and am using python 3.2. Any suggestions?
Thanks.

The comparison fails for the same reason this is false:
a = "data"
b = b"blah (blah) - data"
print(a == b)
The format of that .md5 file is strange, but if it is always in that format, a simple way to test would be:
if fileHandleData.rstrip().endswith(fileHash.encode()):
Because you have fileHash as a (Unicode) string, you have to encode it to bytes to compare. You may want to specify an encoding rather than use the current default string encoding.
If that exact format is always expected, it would be more robust to use a regex to extract the hash value and possibly check the filename.
Or, more flexibly, you could test substring presence:
if fileHash.encode() in fileHandleData:

You are comparing a hash value to the contents of the fileHandle. You need to get rid of the MD5 (hashFile.tbz) = part as well as the trailing newline, so try:
if fileHash == fileHandleData.rsplit(' ', 1)[-1].rstrip():
print ("Good")
else:
print ("Bad")
keep in mind that in Python 3, rsplit() and rstrip() do not support the buffer API and only operate on strings. Hence, as Fred Nurk correctly added, you also need to encode/decode fileHandleData/fileHash (a byte buffer or a (Unicode) string, respectively).

The hash values are identical, but the strings are not. You need to get the hex value of the digest, and you need to parse the hash out of the file. Once you have done those you can compare them for equality.

Try "fileHash.strip("\n")...then compare the two. That should fix the problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python hash of identical files comes out different - python

You are getting the python hash of the file object. Not the contents of the file. As a minimum you should print(hash(open(address1, 'rb').read()) print(hash(open(dest_name, 'rb').read()) But since this still risks collisions, you should do as Martijn suggests and use a hashlib function.

Related

Python--best way to use "open" command with in-memory str

Python hash from file not as expected

Python - mechanism to identify compressed file type and uncompress

how to compare values in an existing dictionary and update the dictionary back to a file?

Python MD5 Hash comparison in Python 3.2

Categories

Resources