I have a backup hard drive that I know has duplicate files scattered around and I decided it would be a fun project to write a little python script to find them and remove them. I wrote the following code just to traverse the drive and calculate the md5 sum of each file and compare it to what I am going to call my "first encounter" list. If the md5 sum does not yet exist, then add it to the list. If the sum does already exist, delete the current file.
import sys
import os
import hashlib
def checkFile(fileHashMap, file):
fReader = open(file)
fileData = fReader.read();
fReader.close()
fileHash = hashlib.md5(fileData).hexdigest()
del fileData
if fileHash in fileHashMap:
### Duplicate file.
fileHashMap[fileHash].append(file)
return True
else:
fileHashMap[fileHash] = [file]
return False
def main(argv):
fileHashMap = {}
fileCount = 0
for curDir, subDirs, files in os.walk(argv[1]):
print(curDir)
for file in files:
fileCount += 1
print("------------: " + str(fileCount))
print(curDir + file)
checkFile(fileHashMap, curDir + file)
if __name__ == "__main__":
main(sys.argv)
The script processes about 10Gb worth of files and then throws MemoryError on the line 'fileData = fReader.read()'. I thought that since I am closing the fReader and marking the fileData for deletion after I have calculated the md5 sum I wouldn't run into this. How can I calculate the md5 sums without running into this memory error?
Edit: I was requested to remove the dictionary and look at the memory usage to see if there may be a leak in hashlib. Here was the code I ran.
import sys
import os
import hashlib
def checkFile(file):
fReader = open(file)
fileData = fReader.read();
fReader.close()
fileHash = hashlib.md5(fileData).hexdigest()
del fileData
def main(argv):
for curDir, subDirs, files in os.walk(argv[1]):
print(curDir)
for file in files:
print("------: " + str(curDir + file))
checkFile(curDir + file)
if __name__ == "__main__":
main(sys.argv)
and I still get the memory crash.
Your problem is in reading the entire files, they're too big and your system can't load it all in memory, so then it throws the error.
As you can see in the Official Python Documentation, the MemoryError is:
Raised when an operation runs out of memory but the situation may
still be rescued (by deleting some objects). The associated value is a
string indicating what kind of (internal) operation ran out of memory.
Note that because of the underlying memory management architecture
(C’s malloc() function), the interpreter may not always be able to
completely recover from this situation; it nevertheless raises an
exception so that a stack traceback can be printed, in case a run-away
program was the cause.
For your purpose, you can use hashlib.md5()
In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the Md5 function:
def md5(fname):
hash = hashlib.md5()
with open(fname) as f:
for chunk in iter(lambda: f.read(4096), ""):
hash.update(chunk)
return hash.hexdigest()
Not a solution to your memory problem, but an optimization that might avoid it:
small files: calculate md5 sum, remove duplicates
big files: remember size and path
at the end, only calculate md5sums of files of same size when there is more than one file
Python's collection.defaultdict might be useful for this.
How about calling openssl command from python
In both windows and Linux
$ openssl md5 "file"
Related
I'm trying to write a program somewhat similar to old virus scanners, where I walk from the root directory of a system and find the md5 checksum for each file, and then compare it to a checksum inputted by the user. I'm running into an issue where while the script walks through the file system and reads then converts the data into the md5 checksum, certain files will basically stall the program indefinitely with no error message.
I have a try/except to check whether the file is readable before I try reading the file and getting the md5, and every time the program stalls, it'll be on the same files, and it will get stuck on the f.read() function. My code is below and I appreciate any help. I'm using the try/except the way that the os.access() documentation suggests.
def md5Check():
flist = []
md5list = []
badlist = []
for path, dirname, fname in os.walk(rootdir):
for name in fname:
filetype = os.access(os.path.join(path, name), os.R_OK)
item = os.path.join(path, name)
try:
ft = open(item, 'rb')
ft.close()
if filetype== True:
with open(item, 'rb') as f:
md5obj = hashlib.md5()
fdata = f.read()
md5obj.update(fdata)
md5list.append(md5obj.hexdigest())
print(f'try: {item}')
except (PermissionError, FileNotFoundError, OSError, IOError):
badlist.append(item)
print(f'except:{item}')
Also, keep in mind that the functionality for comparing a user-entered checksum is not yet coded in, as i cant even get a full array of md5 checksums to compare to, since it stalls before walking the whole filesystem and converting them to md5
I've also tried using the os.access() method with os.R_OK, but the documentation says that its an insecure way to check for readability, so i opted for the simple try/except with open() instead
Lastly, this program should run the same on windows/linux/macos (and everything so far does, including this persistent issue), so any OS specific solutions wont work for this particular project
Any help is appreciated!
I think the primary cause of your problem is coming from using os.access; i.e. calling os.access("/dev/null", ...) is what is causing your program to hang.
In order to avoid attempting to get the hash of a symlink or a device file descriptor, or some other unreadable file system type you should check while traversing each item, to see if the target is in fact a file.
...
for name in fname:
fullname = os.path.join(path, name)
if os.path.isfile(fullname):
try:
with open(fullname, 'rb') as f:
md5obj = hashlib.md5(f.read())
md5list.append(md5obj.hexdigest())
print(f'try: {fullname}')
except (PermissionError, FileNotFoundError, OSError, IOError):
badlist.append(fullname)
print(f'except:{fullname}')
If that method doesn't work for you another option is to use pathlib which is also cross-platform and has an OOP approach to dealing with the filesystem. I have tested this and confirmed it does return false on files such as dev/null
from pathlib import Path
for name in fname:
fullname = Path(path) / name
if fullname.is_file():
try:
md5obj = hashlib.md5(fullname.read_bytes())
md5list.append(md5obj.hexdigest())
...
...
Need to run a python script to do the following which i am doing manually right now for testing
cat /dev/pts/5
And then i need to echo this back to /dev/pts/6
echo <DATA_RECEIVED_FROM_5> > /dev/pts/6
I can't seem to get the python to actually read what is coming in from /dev/pts/5 and saving it to a list and then outputing one by one to /dev/pts/6 using echo
#!/bin/python
import sys
import subprocess
seq = []
count = 1
while True:
term = subprocess.call(['cat','/dev/pts/5'])
seq.append(term)
if len(seq) == count:
for i in seq:
subprocess.call(['echo',i,'/dev/pts/6'])
seq = []
count = count + 1
I'm not sure I understand your problem, and desired outcome, but to generate a list of filenames within /dev/pts/5 and just save it as a .txt file in /dev/pts/6 You should use the os module that comes standard with python. You can accomplish this task as such:
import os
list_of_files = []
for dirpath, dirnames, filenames in os.walk('/dev/pts/5'):
list_of_files.append([dirpath, dirnames, filenames])
with open('/dev/pts/6/output.txt', 'w+') as file:
for file_info in list_of_files:
file.write("{} -> {} -> {}".format(file_info[0], file_info[1], file_info[2]))
The output from this will likely be a bit much, but you can just apply some logic to filter out what you're looking for.
os.walk() documentation
update
To read the data from an arbitrary file and write it to an arbitrary file (no extensions) is (if I understand correctly) pretty easy to do in python:
with open('/dev/pts/5', 'rb') as file: # use 'rb' to handle arbitrary data formats
data = file.read()
with open(''/dev/pts/6', 'wb+') as file:
# 'wb+' will handle arbitrary data and make file if it doesn't exist.
# if it does exist it will be overwritten!! To append instead use 'ab+'
file.write(data)
third time is the charm
Following the example here it looks like you need to run:
term = subprocess.run(['cat','/dev/pts/5'], capture_output=True)
print(term.stdout)
Where the important bit is capture_output=True and then you have to access the .stdout of the CompletedProcess object!
I use the following simple Python script to compress a large text file (say, 10GB) on an EC2 m3.large instance. However, I always got a MemoryError:
import gzip
with open('test_large.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
f_out.writelines(f_in)
# or the following:
# for line in f_in:
# f_out.write(line)
The traceback I got is:
Traceback (most recent call last):
File "test.py", line 8, in <module>
f_out.writelines(f_in)
MemoryError
I have read some discussion about this issue, but still not quite clear how to handle this. Can someone give me a more understandable answer about how to deal with this problem?
The problem here has nothing to do with gzip, and everything to do with reading line by line from a 10GB file with no newlines in it:
As an additional note, the file I used to test the Python gzip functionality is generated by fallocate -l 10G bigfile_file.
That gives you a 10GB sparse file made entirely of 0 bytes. Meaning there are no newline bytes. Meaning the first line is 10GB long. Meaning it will take 10GB to read the first line. (Or possibly even 20 or 40GB, if you're using pre-3.3 Python and trying to read it as Unicode.)
If you want to copy binary data, don't copy line by line. Whether it's a normal file, a GzipFile that's decompressing for you on the fly, a socket.makefile(), or anything else, you will have the same problem.
The solution is to copy chunk by chunk. Or just use copyfileobj, which does that for you automatically.
import gzip
import shutil
with open('test_large.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
By default, copyfileobj uses a chunk size optimized to be often very good and never very bad. In this case, you might actually want a smaller size, or a larger one; it's hard to predict which a priori.* So, test it by using timeit with different bufsize arguments (say, powers of 4 from 1KB to 8MB) to copyfileobj. But the default 16KB will probably be good enough unless you're doing a lot of this.
* If the buffer size is too big, you may end up alternating long chunks of I/O and long chunks of processing. If it's too small, you may end up needing multiple reads to fill a single gzip block.
That's odd. I would expect this error if you tried to compress a large binary file that didn't contain many newlines, since such a file could contain a "line" that was too big for your RAM, but it shouldn't happen on a line-structured .csv file.
But anyway, it's not very efficient to compress files line by line. Even though the OS buffers disk I/O it's generally much faster to read and write larger blocks of data, eg 64 kB.
I have 2GB of RAM on this machine, and I just successfully used the program below to compress a 2.8GB tar archive.
#! /usr/bin/env python
import gzip
import sys
blocksize = 1 << 16 #64kB
def gzipfile(iname, oname, level):
with open(iname, 'rb') as f_in:
f_out = gzip.open(oname, 'wb', level)
while True:
block = f_in.read(blocksize)
if block == '':
break
f_out.write(block)
f_out.close()
return
def main():
if len(sys.argv) < 3:
print "gzip compress in_file to out_file"
print "Usage:\n%s in_file out_file [compression_level]" % sys.argv[0]
exit(1)
iname = sys.argv[1]
oname = sys.argv[2]
level = int(sys.argv[3]) if len(sys.argv) > 3 else 6
gzipfile(iname, oname, level)
if __name__ == '__main__':
main()
I'm running Python 2.6.6 and gzip.open() doesn't support with.
As Andrew Bay notes in the comments, if block == '': won't work correctly in Python 3, since block contains bytes, not a string, and an empty bytes object doesn't compare as equal to an empty text string. We could check the block length, or compare to b'' (which will also work in Python 2.6+), but the simple way is if not block:.
It is weird to get a memory error even when reading a file line by line. I suppose it is because you have very little available memory and very large lines. You should then use binary reads :
import gzip
#adapt size value : small values will take more time, high value could cause memory errors
size = 8096
with open('test_large.csv', 'rb') as f_in:
with gzip.open('test_out.csv.gz', 'wb') as f_out:
while True:
data = f_in.read(size)
if data == '' : break
f_out.write(data)
I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:
import hashlib
inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()
md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()
sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()
print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed
TL;DR use buffers to not use tons of memory.
We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!
import sys
import hashlib
# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536 # lets read stuff in 64kb chunks!
md5 = hashlib.md5()
sha1 = hashlib.sha1()
with open(sys.argv[1], 'rb') as f:
while True:
data = f.read(BUF_SIZE)
if not data:
break
md5.update(data)
sha1.update(data)
print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))
What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!
You can test this with:
$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d bigfile
Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python
Addendum!
In general when writing python it helps to get into the habit of following [pep-8][4]. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.
For the correct and efficient computation of the hash value of a file (in Python 3):
Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
Use readinto() to avoid buffer churning.
Example:
import hashlib
def sha256sum(filename):
h = hashlib.sha256()
b = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
while n := f.readinto(mv):
h.update(mv[:n])
return h.hexdigest()
Note that the while loop uses an assignment expression which isn't available in Python versions older than 3.8.
With older Python 3 versions you can use an equivalent variation:
import hashlib
def sha256sum(filename):
h = hashlib.sha256()
b = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
for n in iter(lambda : f.readinto(mv), 0):
h.update(mv[:n])
return h.hexdigest()
I would propose simply:
def get_digest(file_path):
h = hashlib.sha256()
with open(file_path, 'rb') as file:
while True:
# Reading is buffered, so we can read smaller chunks.
chunk = file.read(h.block_size)
if not chunk:
break
h.update(chunk)
return h.hexdigest()
All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.
Here is a Python 3, POSIX solution (not Windows!) that uses mmap to map the object into memory.
import hashlib
import mmap
def sha256sum(filename):
h = hashlib.sha256()
with open(filename, 'rb') as f:
with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
h.update(mm)
return h.hexdigest()
I have programmed a module wich is able to hash big files with different algorithms.
pip3 install py_essentials
Use the module like this:
from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")
You do not need to define a function with 5-20 lines of code to do this! Save your time by using the pathlib and hashlib libraries, also py_essentials is another solution, but third-parties are *****.
from pathlib import Path
import hashlib
filepath = '/path/to/file'
filebytes = Path(filepath).read_bytes()
filehash_sha1 = hashlib.sha1(filebytes)
filehash_md5 = hashlib.md5(filebytes)
print(f'MD5: {filehash_md5}')
print(f'SHA1: {filehash_sha1}')
I used a few variables here to show the steps, you know how to avoid it.
What do you think about the below function?
from pathlib import Path
import hashlib
def compute_filehash(filepath: str, hashtype: str) -> str:
"""Computes the requested hash for the given file.
Args:
filepath: The path to the file to compute the hash for.
hashtype: The hash type to compute.
Available hash types:
md5, sha1, sha224, sha256, sha384, sha512, sha3_224,
sha3_256, sha3_384, sha3_512, shake_128, shake_256
Returns:
A string that represents the hash.
Raises:
ValueError: If the hash type is not supported.
"""
if hashtype not in ['md5', 'sha1', 'sha224', 'sha256', 'sha384',
'sha512', 'sha3_224', 'sha3_256', 'sha3_384',
'sha3_512', 'shake_128', 'shake_256']:
raise ValueError(f'Hash type {hashtype} is not supported.')
return getattr(hashlib, hashtype)(
Path(filepath).read_bytes()).hexdigest()
FWIW, I prefer this version, which has the same memory and performance characteristics as maxschlepzig's answer but is more readable IMO:
import hashlib
def sha256sum(filename, bufsize=128 * 1024):
h = hashlib.sha256()
buffer = bytearray(bufsize)
# using a memoryview so that we can slice the buffer without copying it
buffer_view = memoryview(buffer)
with open(filename, 'rb', buffering=0) as f:
while True:
n = f.readinto(buffer_view)
if not n:
break
h.update(buffer_view[:n])
return h.hexdigest()
Starting Python 3.11, you can use file_digest() method, which takes responsibility of reading files:
import hashlib
with open(inputFile, "rb") as f:
digest = hashlib.file_digest(f, "sha256")
import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
print(h2,file=e)
with open("encrypted.txt","r") as e:
p = e.readline().strip()
print(p)
I'm trying to use the Python GZIP module to simply uncompress several .gz files in a directory. Note that I do not want to read the files, only uncompress them. After searching this site for a while, I have this code segment, but it does not work:
import gzip
import glob
import os
for file in glob.glob(PATH_TO_FILE + "/*.gz"):
#print file
if os.path.isdir(file) == False:
shutil.copy(file, FILE_DIR)
# uncompress the file
inF = gzip.open(file, 'rb')
s = inF.read()
inF.close()
the .gz files are in the correct location, and I can print the full path + filename with the print command, but the GZIP module isn't getting executed properly. what am I missing?
If you get no error, the gzip module probably is being executed properly, and the file is already getting decompressed.
The precise definition of "decompressed" varies on context:
I do not want to read the files, only uncompress them
The gzip module doesn't work as a desktop archiving program like 7-zip - you can't "uncompress" a file without "reading" it. Note that "reading" (in programming) usually just means "storing (temporarily) in the computer RAM", not "opening the file in the GUI".
What you probably mean by "uncompress" (as in a desktop archiving program) is more precisely described (in programming) as "read a in-memory stream/buffer from a compressed file, and write it to a new file (and possibly delete the compressed file afterwards)"
inF = gzip.open(file, 'rb')
s = inF.read()
inF.close()
With these lines, you're just reading the stream. If you expect a new "uncompressed" file to be created, you just need to write the buffer to a new file:
with open(out_filename, 'wb') as out_file:
out_file.write(s)
If you're dealing with very large files (larger than the amount of your RAM), you'll need to adopt a different approach. But that is the topic for another question.
You're decompressing file into s variable, and do nothing with it. You should stop searching stackoverflow and read at least python tutorial. Seriously.
Anyway, there's several thing wrong with your code:
you need is to STORE the unzipped data in s into some file.
there's no need to copy the actual *.gz files. Because in your code, you're unpacking the original gzip file and not the copy.
you're using file, which is a reserved word, as a variable. This is not
an error, just a very bad practice.
This should probably do what you wanted:
import gzip
import glob
import os
import os.path
for gzip_path in glob.glob(PATH_TO_FILE + "/*.gz"):
if os.path.isdir(gzip_path) == False:
inF = gzip.open(gzip_path, 'rb')
# uncompress the gzip_path INTO THE 's' variable
s = inF.read()
inF.close()
# get gzip filename (without directories)
gzip_fname = os.path.basename(gzip_path)
# get original filename (remove 3 characters from the end: ".gz")
fname = gzip_fname[:-3]
uncompressed_path = os.path.join(FILE_DIR, fname)
# store uncompressed file data from 's' variable
open(uncompressed_path, 'w').write(s)
You should use with to open files and, of course, store the result of reading the compressed file. See gzip documentation:
import gzip
import glob
import os
import os.path
for gzip_path in glob.glob("%s/*.gz" % PATH_TO_FILE):
if not os.path.isdir(gzip_path):
with gzip.open(gzip_path, 'rb') as in_file:
s = in_file.read()
# Now store the uncompressed data
path_to_store = gzip_fname[:-3] # remove the '.gz' from the filename
# store uncompressed file data from 's' variable
with open(path_to_store, 'w') as f:
f.write(s)
Depending on what exactly you want to do, you might want to have a look at tarfile and its 'r:gz' option for opening files.
I was able to resolve this issue by using the subprocess module:
for file in glob.glob(PATH_TO_FILE + "/*.gz"):
if os.path.isdir(file) == False:
shutil.copy(file, FILE_DIR)
# uncompress the file
subprocess.call(["gunzip", FILE_DIR + "/" + os.path.basename(file)])
Since my goal was to simply uncompress the archive, the above code accomplishes this. The archived files are located in a central location, and are copied to a working area, uncompressed, and used in a test case. the GZIP module was too complicated for what I was trying to accomplish.
Thanks for everyone's help. It is much appreciated!
I think there is a much simpler solution than the others presented given the op only wanted to extract all the files in a directory:
import glob
from setuptools import archive_util
for fn in glob.glob('*.gz'):
archive_util.unpack_archive(fn, '.')