Python hashlib module producing strange results

Python hashlib module producing strange results - python

I'm using the hashlib module to test a hypothesis about hash algorithms and I'm getting strange results. I check my results with the Windows fciv program. The workflow I'm using is this:
Gather the file and algorithm from the user.
Print out the original filename and hashed file using that algorithm.
Test the results with fciv in Windows.
Add a few bytes or a space character to the file.
Print out he new hashed file using the chosen algorithm.
Test the results with the updated file in fciv.
The problem is this:
When I use a .txt file, I am getting the different results as I expected from my program and from fciv. This works perfectly.
Here is the output:
Original Filename: example_docs\testDocument.txt
Original md5 Hash: 62bef8046d4bcbdc46ac81f5e4202fe7
Updated md5 Hash: 78a96b792cf2ea160db5e4823f4bf0c5
However, when I use an .mp4 video file, fciv shows a different hash, but my program does not.
Here is the output:
Original Filename: example_docs\testVideo.mp4
Original md5 Hash: 9a7dcb986e2e756dda60e851a0b03916
Updated md5 Hash: 9a7dcb986e2e756dda60e851a0b03916
It doesn't matter how many times I run my program, the hash remains the same in the output from my program, but fciv displays different results.
Here is my code snippet:
def getHash(filename, algorithm):
h = hashlib.new(algorithm)
h.update(filename)
return h.hexdigest()
print "Original Filename: {file}".format(file=args.file)
with open(args.file, "a+") as inFile:
h = getHash(inFile.read(), args.algorithm)
print "Original {hashname} Hash: {hashed_file}".format(hashname=args.algorithm, hashed_file=h)
with open(args.file, "a+") as inFile:
inFile.write(b'\x07\x08\x07') # Also worked with inFile.write(" ")
with open(args.file, "a+") as inFile:
h = getHash(inFile.read(), args.algorithm)
print "Updated {hashname} Hash: {hashed_file}".format(hashname=args.algorithm, hashed_file=h)
where args.algorithm is md5 and args.file is the user-provided filename.

Open your files always in binary mode with ab+. Otherwise Python on Windows will use text mode for what it thinks are text files.
But I do wonder why you would be using ab+ rather than rb+ if you intend to read the entire file as with ab+ the file pointer starts out at the end where as with rb+ it starts out at the beginning of the file.
See https://stackoverflow.com/a/23566951 for a nice list of the file modes.

Related

'read binary' on files vs. 'join' on strings in Python 2.7

I'm re-writing an older script which generates a lot of temporary files for saving and exchanging information/data between functions. I want to keep them as variables, to avoid the overhead of generating files.
My problem: I encountered a function in which two files are merged on a binary level using this code:
with open(first_file, "ab") as file1, open(second_file, "rb") as file2:
file1.write(file2.read())
I would like to do the same, using strings and the '.join' function like this:
first_file = ''.join([first_file, second_file])
My question: is the .join function equivalent to 'read binary'? Or does the 'read binary' mode even apply to .join?
The data I'm working on is binary, so the simple 'read' command would potentially alter the contents.
So far I found this info in the official Python documentation:
Python on Windows makes a distinction between text and binary files;
the end-of-line characters in text files are automatically altered
slightly when data is read or written. This behind-the-scenes
modification to file data is fine for ASCII text files, but it’ll
corrupt binary data like that in JPEG or EXE files.

Making a small test:
a.txt contains 'Hello', 'b.txt' contains 'World'.
with open('a.txt', "ab") as file1, open('b.txt', "rb") as file2:
file1.write(file2.read())
Now a.txt contains 'HelloWorld'.
Checking with the other snippet, after changing back a.txt to "Hello":
with open('a.txt', "rb") as file1, open('b.txt', "rb") as file2:
first_file = file1.read()
second_file = file2.read()
first_file = b''.join([first_file, second_file])
with open('a.txt', 'wb') as fp:
fp.write(first_file)
Now the content of a.txt is again 'HelloWorld', so the two methods are equivalent (with respect to the result at least).
Obviously, though, the first method is more compact.

Read-binary is somewhat similar to using r"somestring" to indicate raw strings - the underlying file is binary, you're just telling Python to skip trying to decode the binary data into ASCII or UTF-8 or what-have-you characters.
So, the mode doesn't really apply here.
Since join operates on strings, you'd need to open file A, read it as a string, then do the same for B, whereas the original code just needs to read B and seek to the end of file A to start writing. So, you're not really getting much mileage out of doing a str.join, and you're actually using more memory.
If you want to optimize, make a loop that reads B line by line with writes it - that allows you to load just one line's worth of memory at a time rather than dumping the whole B file into it all at once.

Python is reading past the end of the file. Is this a security risk? [duplicate]

Started Python a week ago and I have some questions to ask about reading and writing to the same files. I've gone through some tutorials online but I am still confused about it. I can understand simple read and write files.
openFile = open("filepath", "r")
readFile = openFile.read()
print readFile
openFile = open("filepath", "a")
appendFile = openFile.write("\nTest 123")
openFile.close()
But, if I try the following I get a bunch of unknown text in the text file I am writing to. Can anyone explain why I am getting such errors and why I cannot use the same openFile object the way shown below.
# I get an error when I use the codes below:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
readFile = openFile.read()
print readFile
openFile.close()
I will try to clarify my problems. In the example above, openFile is the object used to open file. I have no problems if I want write to it the first time. If I want to use the same openFile to read files or append something to it. It doesn't happen or an error is given. I have to declare the same/different open file object before I can perform another read/write action to the same file.
#I have no problems if I do this:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
openFile2 = open("filepath", "r+")
readFile = openFile2.read()
print readFile
openFile.close()
I will be grateful if anyone can tell me what I did wrong here or is it just a Pythong thing. I am using Python 2.7. Thanks!

Updated Response:
This seems like a bug specific to Windows - http://bugs.python.org/issue1521491.
Quoting from the workaround explained at http://mail.python.org/pipermail/python-bugs-list/2005-August/029886.html
the effect of mixing reads with writes on a file open for update is
entirely undefined unless a file-positioning operation occurs between
them (for example, a seek()). I can't guess what
you expect to happen, but seems most likely that what you
intend could be obtained reliably by inserting
fp.seek(fp.tell())
between read() and your write().
My original response demonstrates how reading/writing on the same file opened for appending works. It is apparently not true if you are using Windows.
Original Response:
In 'r+' mode, using write method will write the string object to the file based on where the pointer is. In your case, it will append the string "Test abc" to the start of the file. See an example below:
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\n'
>>> f.write("foooooooooooooo")
>>> f.close()
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\nfoooooooooooooo'
The string "foooooooooooooo" got appended at the end of the file since the pointer was already at the end of the file.
Are you on a system that differentiates between binary and text files? You might want to use 'rb+' as a mode in that case.
Append 'b' to the mode to open the file in binary mode, on systems
that differentiate between binary and text files; on systems that
don’t have this distinction, adding the 'b' has no effect.
http://docs.python.org/2/library/functions.html#open

Every open file has an implicit pointer which indicates where data will be read and written. Normally this defaults to the start of the file, but if you use a mode of a (append) then it defaults to the end of the file. It's also worth noting that the w mode will truncate your file (i.e. delete all the contents) even if you add + to the mode.
Whenever you read or write N characters, the read/write pointer will move forward that amount within the file. I find it helps to think of this like an old cassette tape, if you remember those. So, if you executed the following code:
fd = open("testfile.txt", "w+")
fd.write("This is a test file.\n")
fd.close()
fd = open("testfile.txt", "r+")
print fd.read(4)
fd.write(" IS")
fd.close()
... It should end up printing This and then leaving the file content as This IS a test file.. This is because the initial read(4) returns the first 4 characters of the file, because the pointer is at the start of the file. It leaves the pointer at the space character just after This, so the following write(" IS") overwrites the next three characters with a space (the same as is already there) followed by IS, replacing the existing is.
You can use the seek() method of the file to jump to a specific point. After the example above, if you executed the following:
fd = open("testfile.txt", "r+")
fd.seek(10)
fd.write("TEST")
fd.close()
... Then you'll find that the file now contains This IS a TEST file..
All this applies on Unix systems, and you can test those examples to make sure. However, I've had problems mixing read() and write() on Windows systems. For example, when I execute that first example on my Windows machine then it correctly prints This, but when I check the file afterwards the write() has been completely ignored. However, the second example (using seek()) seems to work fine on Windows.
In summary, if you want to read/write from the middle of a file in Windows I'd suggest always using an explicit seek() instead of relying on the position of the read/write pointer. If you're doing only reads or only writes then it's pretty safe.
One final point - if you're specifying paths on Windows as literal strings, remember to escape your backslashes:
fd = open("C:\\Users\\johndoe\\Desktop\\testfile.txt", "r+")
Or you can use raw strings by putting an r at the start:
fd = open(r"C:\Users\johndoe\Desktop\testfile.txt", "r+")
Or the most portable option is to use os.path.join():
fd = open(os.path.join("C:\\", "Users", "johndoe", "Desktop", "testfile.txt"), "r+")
You can find more information about file IO in the official Python docs.

Reading and Writing happens where the current file pointer is and it advances with each read/write.
In your particular case, writing to the openFile, causes the file-pointer to point to the end of file. Trying to read from the end would result EOF.
You need to reset the file pointer, to point to the beginning of the file before through seek(0) before reading from it

You can read, modify and save to the same file in python but you have actually to replace the whole content in file, and to call before updating file content:
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
I needed a function to go through all subdirectories of folder and edit content of the files based on some criteria, if it helps:
new_file_content = ""
for directories, subdirectories, files in os.walk(folder_path):
for file_name in files:
file_path = os.path.join(directories, file_name)
# open file for reading and writing
with io.open(file_path, "r+", encoding="utf-8") as edit_file:
for current_line in edit_file:
if condition in current_line:
# update current line
current_line = current_line.replace('john', 'jack')
new_file_content += current_line
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
# delete actual file content
edit_file.truncate()
# rewrite updated file content
edit_file.write(new_file_content)
# empties new content in order to set for next iteration
new_file_content = ""
edit_file.close()

Python hash from file not as expected

My problem is the following:
I want to create a little tool in Python that creates hash values for entered text or from files. I've created all necessary things, GUI, option to select between hash functions, everything is fine.
But when I was testing the program, I realized, that the from files generated hashes aren't the same as the ones given by most download pages. I was confused, downloaded some other hashing tools, they all gave me the same hash as provided on several websites, but my tool always give some other output.
The odd thing is, the hashes generated from "plain text" are in my and in all other tools identical.
The app uses wxPython, but I've extracted my hash function for hash creation from files:
import os, hashlib
path = "C:\file.txt" # Given from some open file dialog, valid file
text = ""
if os.path.isfile(path):
text_file = open(path, "r")
text = text_file.read()
text_file.close()
print hashlib.new("md5", text).hexdigest() # Could be any hash function
Quite simple, but doesn't work as expected.
It seems to work if there's no new line in the file (\n)?
But how to make it work with newline? It's like every file has more than one line.

It is a problem of quoting the backslash character, see https://docs.python.org/2/reference/lexical_analysis.html#literals. Use two backslashes to specify the file name. I would also recommend reading the file in binary mode. As a precaution, print the length of variable text to make sure the file was read.
import os, hashlib
path = "C:\\file.txt" # Given from some open file dialog, valid file
text = ""
if os.path.isfile(path):
text_file = open(path, "rb")
text = text_file.read()
text_file.close()
print len(text)
print hashlib.new("md5", text).hexdigest() # Could be any hash function

Try splitting text update and md5 object creation as below
import hashlib;
md5=hashlib.new('md5')
with open(filepath,'rb') as f:
for line in f:
md5.update(line)
return md5.hexdigest()

Python2.7 file.obj.write(str) not writing to file

new python2.7 user here. I've searched for similar queries and I can't quite see what I'm doing wrong. I have a short script to read though all the files in a directory, and reading each one in turn, write them to a single master file.
My code is below; I can see two things going wrong at this time (although I get no error messages), 1 - that it only appears to open the first file in the list its supposed to be looping through (so my guess is I've an error using glob?), and 2 - although the print(str) statements display to the console nps, the output file never gets written too.
I've double checked that the file exists, it is empty and I'm passing the correct path & filename in when I call the function.
Any help is much appreciated.
#!/usr/bin/env python
import glob
import sys
filestoberecognised=sys.argv[1]
outputfile=sys.argv[2]
filecontents=glob.glob(filestoberecognised)
with open(outputfile,'w+') as f:
for i, row in enumerate(filecontents):
print(row) # this correctly prints to console
f.write(row+'\n') # this should write the filename of the filestoberecognised to the outputfile
with open(row,'r') as labfile:
for j, line in enumerate(labfile): # this should write words in label file
f.write('%s'%(line))
print('%s'%(line))
labfile.close() # ensures each file looped through is closed
f.write('\n.\n')
f.flush()
f.close()

Beginner Python: Reading and writing to the same file

Started Python a week ago and I have some questions to ask about reading and writing to the same files. I've gone through some tutorials online but I am still confused about it. I can understand simple read and write files.
openFile = open("filepath", "r")
readFile = openFile.read()
print readFile
openFile = open("filepath", "a")
appendFile = openFile.write("\nTest 123")
openFile.close()
But, if I try the following I get a bunch of unknown text in the text file I am writing to. Can anyone explain why I am getting such errors and why I cannot use the same openFile object the way shown below.
# I get an error when I use the codes below:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
readFile = openFile.read()
print readFile
openFile.close()
I will try to clarify my problems. In the example above, openFile is the object used to open file. I have no problems if I want write to it the first time. If I want to use the same openFile to read files or append something to it. It doesn't happen or an error is given. I have to declare the same/different open file object before I can perform another read/write action to the same file.
#I have no problems if I do this:
openFile = open("filepath", "r+")
writeFile = openFile.write("Test abc")
openFile2 = open("filepath", "r+")
readFile = openFile2.read()
print readFile
openFile.close()
I will be grateful if anyone can tell me what I did wrong here or is it just a Pythong thing. I am using Python 2.7. Thanks!

Updated Response:
This seems like a bug specific to Windows - http://bugs.python.org/issue1521491.
Quoting from the workaround explained at http://mail.python.org/pipermail/python-bugs-list/2005-August/029886.html
the effect of mixing reads with writes on a file open for update is
entirely undefined unless a file-positioning operation occurs between
them (for example, a seek()). I can't guess what
you expect to happen, but seems most likely that what you
intend could be obtained reliably by inserting
fp.seek(fp.tell())
between read() and your write().
My original response demonstrates how reading/writing on the same file opened for appending works. It is apparently not true if you are using Windows.
Original Response:
In 'r+' mode, using write method will write the string object to the file based on where the pointer is. In your case, it will append the string "Test abc" to the start of the file. See an example below:
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\n'
>>> f.write("foooooooooooooo")
>>> f.close()
>>> f=open("a","r+")
>>> f.read()
'Test abc\nfasdfafasdfa\nsdfgsd\nfoooooooooooooo'
The string "foooooooooooooo" got appended at the end of the file since the pointer was already at the end of the file.
Are you on a system that differentiates between binary and text files? You might want to use 'rb+' as a mode in that case.
Append 'b' to the mode to open the file in binary mode, on systems
that differentiate between binary and text files; on systems that
don’t have this distinction, adding the 'b' has no effect.
http://docs.python.org/2/library/functions.html#open

Every open file has an implicit pointer which indicates where data will be read and written. Normally this defaults to the start of the file, but if you use a mode of a (append) then it defaults to the end of the file. It's also worth noting that the w mode will truncate your file (i.e. delete all the contents) even if you add + to the mode.
Whenever you read or write N characters, the read/write pointer will move forward that amount within the file. I find it helps to think of this like an old cassette tape, if you remember those. So, if you executed the following code:
fd = open("testfile.txt", "w+")
fd.write("This is a test file.\n")
fd.close()
fd = open("testfile.txt", "r+")
print fd.read(4)
fd.write(" IS")
fd.close()
... It should end up printing This and then leaving the file content as This IS a test file.. This is because the initial read(4) returns the first 4 characters of the file, because the pointer is at the start of the file. It leaves the pointer at the space character just after This, so the following write(" IS") overwrites the next three characters with a space (the same as is already there) followed by IS, replacing the existing is.
You can use the seek() method of the file to jump to a specific point. After the example above, if you executed the following:
fd = open("testfile.txt", "r+")
fd.seek(10)
fd.write("TEST")
fd.close()
... Then you'll find that the file now contains This IS a TEST file..
All this applies on Unix systems, and you can test those examples to make sure. However, I've had problems mixing read() and write() on Windows systems. For example, when I execute that first example on my Windows machine then it correctly prints This, but when I check the file afterwards the write() has been completely ignored. However, the second example (using seek()) seems to work fine on Windows.
In summary, if you want to read/write from the middle of a file in Windows I'd suggest always using an explicit seek() instead of relying on the position of the read/write pointer. If you're doing only reads or only writes then it's pretty safe.
One final point - if you're specifying paths on Windows as literal strings, remember to escape your backslashes:
fd = open("C:\\Users\\johndoe\\Desktop\\testfile.txt", "r+")
Or you can use raw strings by putting an r at the start:
fd = open(r"C:\Users\johndoe\Desktop\testfile.txt", "r+")
Or the most portable option is to use os.path.join():
fd = open(os.path.join("C:\\", "Users", "johndoe", "Desktop", "testfile.txt"), "r+")
You can find more information about file IO in the official Python docs.

Reading and Writing happens where the current file pointer is and it advances with each read/write.
In your particular case, writing to the openFile, causes the file-pointer to point to the end of file. Trying to read from the end would result EOF.
You need to reset the file pointer, to point to the beginning of the file before through seek(0) before reading from it

You can read, modify and save to the same file in python but you have actually to replace the whole content in file, and to call before updating file content:
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
I needed a function to go through all subdirectories of folder and edit content of the files based on some criteria, if it helps:
new_file_content = ""
for directories, subdirectories, files in os.walk(folder_path):
for file_name in files:
file_path = os.path.join(directories, file_name)
# open file for reading and writing
with io.open(file_path, "r+", encoding="utf-8") as edit_file:
for current_line in edit_file:
if condition in current_line:
# update current line
current_line = current_line.replace('john', 'jack')
new_file_content += current_line
# set the pointer to the beginning of the file in order to rewrite the content
edit_file.seek(0)
# delete actual file content
edit_file.truncate()
# rewrite updated file content
edit_file.write(new_file_content)
# empties new content in order to set for next iteration
new_file_content = ""
edit_file.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python hashlib module producing strange results - python

Related

'read binary' on files vs. 'join' on strings in Python 2.7

Python is reading past the end of the file. Is this a security risk? [duplicate]

Python hash from file not as expected

Python2.7 file.obj.write(str) not writing to file

Beginner Python: Reading and writing to the same file

Categories

Resources