I am trying to validate two files downloaded from a server. The first contains data and the second file contains the MD5 hash checksum.
I created a function that returns a hexdigest from the data file like so:
def md5(fileName):
"""Compute md5 hash of the specified file"""
try:
fileHandle = open(fileName, "rb")
except IOError:
print ("Unable to open the file in readmode: [0]", fileName)
return
m5Hash = hashlib.md5()
while True:
data = fileHandle.read(8192)
if not data:
break
m5Hash.update(data)
fileHandle.close()
return m5Hash.hexdigest()
I compare the files using the following:
file = "/Volumes/Mac/dataFile.tbz"
fileHash = md5(file)
hashFile = "/Volumes/Mac/hashFile.tbz.md5"
fileHandle = open(hashFile, "rb")
fileHandleData = fileHandle.read()
if fileHash == fileHandleData:
print ("Good")
else:
print ("Bad")
The file comparison fails so I printed out both fileHash and fileHandleData and I get the following:
[0] b'MD5 (hashFile.tbz) = b60d684ab4a2570253961c2c2ad7b14c\n'
[0] b60d684ab4a2570253961c2c2ad7b14c
From the output above the hash values are identical. Why does the hash comparison fail? I am new to python and am using python 3.2. Any suggestions?
Thanks.
The comparison fails for the same reason this is false:
a = "data"
b = b"blah (blah) - data"
print(a == b)
The format of that .md5 file is strange, but if it is always in that format, a simple way to test would be:
if fileHandleData.rstrip().endswith(fileHash.encode()):
Because you have fileHash as a (Unicode) string, you have to encode it to bytes to compare. You may want to specify an encoding rather than use the current default string encoding.
If that exact format is always expected, it would be more robust to use a regex to extract the hash value and possibly check the filename.
Or, more flexibly, you could test substring presence:
if fileHash.encode() in fileHandleData:
You are comparing a hash value to the contents of the fileHandle. You need to get rid of the MD5 (hashFile.tbz) = part as well as the trailing newline, so try:
if fileHash == fileHandleData.rsplit(' ', 1)[-1].rstrip():
print ("Good")
else:
print ("Bad")
keep in mind that in Python 3, rsplit() and rstrip() do not support the buffer API and only operate on strings. Hence, as Fred Nurk correctly added, you also need to encode/decode fileHandleData/fileHash (a byte buffer or a (Unicode) string, respectively).
The hash values are identical, but the strings are not. You need to get the hex value of the digest, and you need to parse the hash out of the file. Once you have done those you can compare them for equality.
Try "fileHash.strip("\n")...then compare the two. That should fix the problem.
Related
I have some text file with these lines:
Zip=false
Run=false
Copy=true
FileName=c:\test\test.doc
Now I need to load this text file, change some values and write back to same text file.
So I load it to a dictionary, change values on the dictionary and write back.
The problem is that that backslashes in the FileName path are being duplicate and in the new file I get FileName=c:\test\test.doc.
Here is the dictionary creation:
def create_dictionary(filename):
try:
file = open(filename, 'r')
except:
print("Error " + filename + " not found or path is incorrect")
else:
contents = file.read().splitlines()
properties_dict = {}
for line in contents:
if not line.startswith("#") and line.strip():
# print (line)
x, y = line.split("=")
properties_dict[x] = [y]
return properties_dict
Here is writing back to the file
# Update the properties file with updated dictionary
fo = open(properties_file, "w")
for k, v in dic.items():
print(str(k), str(v))
fo.write(str(k) + '=' + str(v).strip("[]'") + '\n')
fo.close()
This seems to be working:
def create_dictionary(file_name):
try:
properties_dict = {}
with open(file_name, "r") as file:
contents = file.read().splitlines()
for line in contents:
if not line.startswith("#") and line.strip():
property_name, property_value = line.split("=")
properties_dict[property_name] = property_value
return properties_dict
except FileNotFoundError:
print(f"Error {file_name} not found or path is incorrect")
def dict_to_file(properties_dict, file_name):
try:
file_dirname = os.path.dirname(file_name)
if not os.path.exists(file_dirname):
os.makedirs(file_dirname)
except FileNotFoundError: # in case the file is in the same directory and "./" was not added to the path
pass
with open(file_name, "w") as file:
for property_name, property_value in properties_dict.items():
file.write(f"{property_name}={property_value}\n")
properties_dict = create_dictionary("./asd.txt")
dict_to_file(properties_dict, "./bsd.txt")
Since, there was a request for more explanations, I am editing this post.
Actually the critical part is not file.write(f"...") as #pktl2k pointed out. The critical part is changing properties_dict[x] = [y] to properties_dict[x] = y.
In Python strings, when you want to escape special characters you use a backslash ( \ ). The FileName parameter in your file has one of those special characters which is also a backslash (FileName=c:\test\test.doc). Thus, when you read this file, Python stores it as string as:
"c:\\test\\test.doc"
Which is totally normal. And when you want to write this string back to a file, you will get your desired output ( no double backslashes ). However, in your code, you do not have this value as a string. You have it as a list that is holding this value as string. When you call str built-in function on a list (which by the way is a built-in class), list class' __repr__ function is called (actually __str__ is called but in list __str__ calls __repr__ as far as I know, but let's not go so much into details of these functions. See this link if you want to learn more about it) to get a string representation of your list. In this process, all your list is converted to a string with all of its elements as it is. Then you get rid of some characters in this string representation using strip("[]'"), this is the actual cause of your problem.
Now, why did I write everything from the beginning and not only the part that is important as #pktl2k kindly asked. The reason is because if you noticed in create_dictionary function the author forgot to close the file using file.close(). This is a common problem and that's why there is a syntax like with open(....). I wanted to emphasis that: it is better to use with open(...) syntax whenever we would like to manipulate contents of a file. I could also write this as a small note as well, but I think it is better with this way (so it is a personal preference).
I've written a simple Python script to copy a file from one place to another. (It's for class, so that's why I'm not using something simpler like shutil. I have a check at the end that compares the hash of the two files, and it consistently tells me they're different, even though the copying is successful - both are text files that say "hello world".
Here is my code:
import os
def validity_checker(address1, dest_name):
try:
src = open(address1, 'rb')
dest = open(dest_name, 'wb+')
except IOError:
return False
return True
def copaste(address1, address2):
# concatenate address2 into filename
file_ending = address1.split('\\').pop()
dest_name = address2 + '\\' + file_ending
# copy file after calling checker
if validity_checker(address1, dest_name):
src = open(address1, 'rb')
dest = open(dest_name, 'wb+')
contents = src.read()
dest.write(contents)
src.close()
dest.close()
else:
print("File name bad. No action taken")
print src
print dest
print(hash(src)) #hash the file not the string
print(hash(dest))
return
And the output:
<closed file 'C:\\Users\\user\\Downloads\\hello.txt', mode 'rb' at 0x04B7D1D8>
<closed file 'C:\\Users\\user\\Downloads\\dest\\hello.txt', mode 'wb+' at 0x04C2B860>
-2042961099
4991878
Plus the file is copied.
I'm fairly sure the hash is checking the file itself, not the string. Is it maybe something to do with metadata? Any help would be greatly appreciated.
You are using the Python-specific hash() function, which calculates a hash for use on dictionary keys and set contents.
For file objects, the hash() is based on the object identity; you can't base it on anything else because two distinct file objects are never equal, the fileobject.__eq__ method returns True only if both objects are one and the same in memory (so is would be true too). The file contents, the name of the file, the mode or any of the other object attributes play no role in the hash value produced.
From the function documentation:
Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup.
If you need to validate that the file copy contains the same data, you need to hash the •file contents* using a cryptographic hash function, which is something completely different. Use the hashlib module; for your usecase the simple and fast MD5 algorithm will do:
for closed_file in (src, dest):
with open(closed_file.name, 'rb') as reopened: # opened in binary mode!
print(reopened.name)
print(hashlib.md5(reopened.read()).hexdigest())
If the binary contents of the two files is exactly the same, then their cryptographic hash will also be the same.
You are getting the python hash of the file object. Not the contents of the file. As a minimum you should
print(hash(open(address1, 'rb').read())
print(hash(open(dest_name, 'rb').read())
But since this still risks collisions, you should do as Martijn suggests and use a hashlib function.
I'm trying to read a file in python as binary.
Im interested in four bytes at a time, however I seem to be stuck in the infamous while loop:
with open(filename, "rb") as file:
while file:
file.read(4)
print "EOF"
I've been trying this for the past hour, I never reach the end of the file, even in tiny text files. I did a "print test = file.read(4)" only to see that it prints ""
How can I make sure it stops? My first idea was to make a if statement saying if file.read(4) (in a variable) == ""{4} or something, but this might actually appear in a file, right? so it could potentially stop in the middle of it.
Is the only other option to beforehand calculate the size of the file?
At the end of the file, file.read(..) will return an bytes (or string depending on your python version):
Check the return value of the file.read; break if it's empty:
with open(filename, "rb") as file:
while True: # --> replaced `file` with `True` to be clear
data = file.read(4)
if not data: # empty => EOF
# OR if len(data) < 4: if you don't want last incomplete chunk
break
# process data
file is an _io.BufferReader object, not None, so never be treated as False.
You should check if the return value of file.read(4) is an empty string(treated as False).
This is my first question here, I'm new to python and trying to figure some things out to set up an automatic 3D model processing chain that relies on data being stored in JSON files moving from one server to another.
The problem is that I need to store absolute paths to files that are being processed, but these absolute paths should be modified in the original JSON files upon the first time that they are processed.
Basically the JSON file comes in like this:
{
"normaldir": "D:\\Outgoing\\1621_1\\",
"projectdir": "D:\\Outgoing\\1622_2\\"
}
And I would like to rename the file paths to
{
"normaldir": "X:\\Incoming\\1621_1\\",
"projectdir": "X:\\Incoming\\1622_2\\",
}
What I've been trying to do is replace the first part of the path using this code, but it isn't working:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as file:
content = file.read()
file.seek(0)
content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
file.write(content)
However this was not working at all, so I tried interpreting the JSON file properly and replacing the key code from here:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'r+') as settingsData:
settings = json.load(settingsData)
settings['normaldir'] = 'X:\\Incoming\\1621_1\\'
settings['projectdir'] = 'X:\\Incoming\\1622_2\\'
settingsData.seek(0) # rewind to beginning of file
settingsData.write(json.dumps(settings,indent=2,sort_keys=True)) #write the updated version
settingsData.truncate() #truncate the remainder of the data in the file
This works perfectly, however I'm replacing the whole path so it won't really work for every JSON file that I need to process. What I would really like to do is to take a JSON key corresponding to a file path, keep the last 8 characters and replace the rest of the patch with a new string, but I can't figure out how to do this using json in python, as far as I can tell I can't edit part of a key.
Does anyone have a workaround for this?
Thanks!
Your replace logic failed as you need to reassign content to the new string,str.replace is not an inplace operation, it creates a new string:
content = content.replace("D:\\Outgoing\\", "X:\\Incoming\\")
Using the json approach just do a replace too, using the current value:
settings['normaldir'] = settings['normaldir'].replace("D:\\Outgoing\\", "X:\\Incoming\\")
You also would want truncate() before you write or just reopen the file with w and dump/write the new value, if you really wanted to just keep the last 8 chars and prepend a string:
settings['normaldir'] = "X:\\Incoming\\" + settings['normaldir'][-8:]
Python come with a json library.
With this library, you can read and write JSON files (or JSON strings).
Parsed data is converted to Python objects and vice versa.
To use the json library, simply import it:
import json
Say your data is stored in input_data.json file.
input_data_path = "input_data.json"
You read the file like this:
import io
with io.open(input_data_path, mode="rb") as fd:
obj = json.load(fd)
or, alternatively:
with io.open(input_data_path, mode="rb") as fd:
content = fd.read()
obj = json.loads(content)
Your data is automatically converted into Python objects, here you get a dict:
print(repr(obj))
# {u'projectdir': u'D:\\Outgoing\\1622_2\\',
# u'normaldir': u'D:\\Outgoing\\1621_1\\'}
note: I'm using Python 2.7 so you get the unicode string prefixed by "u", like u'projectdir'.
It's now easy to change the values for normaldir and projectdir:
obj["normaldir"] = "X:\\Incoming\\1621_1\\"
obj["projectdir"] = "X:\\Incoming\\1622_2\\"
Since obj is a dict, you can also use the update method like this:
obj.update({'normaldir': "X:\\Incoming\\1621_1\\",
'projectdir': "X:\\Incoming\\1622_2\\"})
That way, you use a similar syntax like JSON.
Finally, you can write your Python object back to JSON file:
output_data_path = "output_data.json"
with io.open(output_data_path, mode="wb") as fd:
json.dump(obj, fd)
or, alternatively with indentation:
content = json.dumps(obj, indent=True)
with io.open(output_data_path, mode="wb") as fd:
fd.write(content)
Remarks: reading/writing JSON objects is faster with a buffer (the content variable).
.replace returns a new string, and don't change it. But you should not treat json-files as normal text files, so you can combine parsing json with replace:
def processscan(scanfile):
configfile= MonitorDirectory + scanfile
with open(configfile, 'rb') as settingsData:
settings = json.load(settingsData)
settings = {k: v.replace("D:\\Outgoing\\", "X:\\Incoming\\")
for k, v in settings.items()
}
with open(configfile, 'wb') as settingsData:
json.dump(settings, settingsData)
I have the following python sript, which double hashes a hex value:
import hashlib
linestring = open('block_header.txt', 'r').read()
header_hex = linestring.encode("hex") // Problem!!!
print header_hex
header_bin = header_hex.decode('hex')
hash = hashlib.sha256(hashlib.sha256(header_bin).digest()).digest()
hash.encode('hex_codec')
print hash[::-1].encode('hex_codec')
My text file "block_header.txt" (hex) looks like this:
0100000081cd02ab7e569e8bcd9317e2fe99f2de44d49ab2b8851ba4a308000000000000e320b6c2fffc8d750423db8b1eb942ae710e951ed797f7affc8892b0f1fc122bc7f5d74df2b9441a42a14695
Unfortunately, the result from printing the variable header_hex looks like this (not like the txt file):
303130303030303038316364303261623765353639653862636439333137653266653939663264653434643439616232623838353162613461333038303030303030303030303030653332306236633266666663386437353034323364623862316562393432616537313065393531656437393766376166666338383932623066316663313232626337663564373464663262393434316134326131343639350a
I think the problem is in this line:
header_hex = linestring.encode("hex")
If I remove the ".encode("hex")"-part, then I get the error
unhandled TypeError "Odd-length string"
Can anyone give me a hint what might be wrong?
Thank you a lot :)
You're doing too much encoding/decoding.
Like others mentioned, if your input data is hex, then it's a good idea to strip leading / trailing whitespace with strip().
Then, you can use decode('hex') to turn the hex ASCII into binary. After performing whatever hashing you want, you'll have the binary digest.
If you want to be able to "see" that digest, you can turn it back into hex with encode('hex').
The following code works on your input file with any kinds of whitespace added at the beginning or end.
import hashlib
def multi_sha256(data, iterations):
for i in xrange(iterations):
data = hashlib.sha256(data).digest()
return data
with open('block_header.txt', 'r') as f:
hdr = f.read().strip().decode('hex')
_hash = multi_sha256(hdr, 2)
# Print the hash (in hex)
print 'Hash (hex):', _hash.encode('hex')
# Save the hash to a hex file
open('block_header_hash.hex', 'w').write(_hash.encode('hex'))
# Save the hash to a binary file
open('block_header_hash.bin', 'wb').write(_hash)