I am using a SHA1 hash to identify identical files within directories and then delete the duplicates. Whenever I run my script, it seems that an appropriate hash is created for most of my files. However, I have several files that are very different that are coming up with the same hash. I know it is virtually impossible for collisions like this to occur, so am wondering if there is something wrong in my code below:
hashmap = {}
for path, dirs, files in os.walk(maindirectory):
for filename in files:
fullname = os.path.join(path, filename)
with open(fullname) as f:
d = f.read()
h = hashlib.sha1(d).hexdigest()
filelist = hashmap.setdefault(h, [])
filelist.append(fullname)
# delete records in dictionary that have only 1 item (meaning no duplicate)
for k, v in hashmap.items():
if len(v) == 1:
del hashmap[k]
When I return my dictionary, hashmap, there are a few predictable duplicates and then there is one hash with about 40 unique files. Any suggestions, tips, or explanations would be most appreciated!
Related
I am trying to create a dictionary as follows:
key( file_size checksum ) = filename.
I want the double keywords to make up a key taking into account both value. These keys are derived from the actual file in question. If the key is matched, I have a duplicate file, not just a duplicate file name.
It is easy to determine duplicity if there was a single key:filename. But not all files will have the same filename, either by path or in actual filename. So far, no Python website has been able to supply an answer. Although one did have this format, I haven't found it again.
I have tried various combinations of brackets and commas with little effect.
A simple example that finds duplicate files in subdirs using a dictionary like you suggest:
from pathlib import Path
import hashlib
found = {}
for fn in Path('.').glob('**/*'):
if fn.is_file():
with open(fn, 'rb') as f:
key = (fn.stat().st_size, hashlib.md5(f.read()).hexdigest())
if key in found:
print(f'File {fn} is a duplicate of {found[key]}')
else:
found[key] = fn
Specifically, key is a tuple of two integers, the first the file size and the second an MD5 hash (i.e. a sort of checksum, as you suggest).
Note that this bit of code reads every file it encounters in its entirety, so I suggest pointing it at a small or moderate file collection. Computing a full checksum or hash for each file isn't a very fast way to find duplicates.
You should consider looking at more basic attributes first (size, hash for the first few bytes, etc.) and only doing a full hash if you have candidates for being duplicates.
Also, the odds of two files having the same MD5 hash, but somehow different sizes is astronomically small, so adding the file size to the key here is more or less pointless if you hash the whole file.
(Edit) This is a somewhat better way to achieve the same, a lot faster:
from pathlib import Path
import hashlib
import filecmp
found = {}
for fn in Path('.').glob('**/*'):
if fn.is_file():
with open(fn, 'rb') as f:
key = (fn.stat().st_size, hashlib.md5(f.read(1024)).hexdigest())
if key in found:
if filecmp.cmp(fn, found[key], shallow=False):
print(f'File {fn} is a duplicate of {found[key]}')
else:
found[key] = fn
Hey guys I was able to get the hashes to individually work for the files I input. I was wondering how to add the file name to the list I have set and the hash to the other dictionary. I feel like it is a pretty simple fix I am just stuck. we have to use a file path on our machines that has been established already. the folder I am using has 3 or 4 files in it. I am just trying to figure out how to add each of the hases to the lists Thanks!
from __future__ import print_function
'''
Week Two Assignment 2 - File Hashing
'''
'''
Complete the script below to do the following:
1) Add your name, date, assignment number to the top of this script
2) Using the os library and the os.walk() method
a) Create a list of all files
b) Create an empty dictionary named fileHashes
c) Iterate through the list of files and
- calculate the md5 hash of each file
- create a dictionary entry where:
key = md5 hash
value = filepath
d) Iterate through the dictionary
- print out each key, value pair
3) Submit
NamingConvention: lastNameFirstInitial_Assignment_.ext
for example: hosmerC_WK1_script.py
hosmerC_WK1_screenshot.jpg
A) Screenshot of the results in WingIDE
B) Your Script
'''
import os
import hashlib
import sys
directory = "."
fileList = []
fileHashes = {}
# Psuedo Constants
SCRIPT_NAME = "Script: ASSIGNMENT NAME"
SCRIPT_AUTHOR = "Author: NAME"
SCRIPT_DATE = "Date: 25 January 2021"
print(SCRIPT_NAME)
print(SCRIPT_AUTHOR)
print(SCRIPT_DATE)
for root, dirs, files in os.walk(directory):
# Walk the path from top to bottom.
# For each file obtain the filename
for fileName in files:
path = os.path.join(root, fileName)
fullPath = os.path.abspath(path)
print(files)
''' Determine which version of Python '''
if sys.version_info[0] < 3:
PYTHON_2 = True
else:
PYTHON_2 = False
def HashFile(filePath):
'''
function takes one input a valid filePath
returns the hexdigest of the file
or error
'''
try:
with open(filePath, 'rb') as fileToHash:
fileContents = fileToHash.read()
hashObj = hashlib.md5()
hashObj.update(fileContents)
digest = hashObj.hexdigest()
return digest
except Exception as err:
return str(err)
print()
if PYTHON_2:
fileName = raw_input("Enter file to hash: ")
else:
fileName = input("Enter file to hash: ")
hexDigest = HashFile(fileName)
print(hexDigest)
Well, you've done most of the work in the assignment, so kudos to you for that. You just need to refine a few things and use your functions together.
For item "a) Create a list of all files": In the for fileName in files: loop, add the line fileList.append(fullPath). (See list.append() for more info.)
indent it so it's part of the for loop.
Btw, the print(files) line you have is outside the loop so it will only print the files of the last folder that was os.walk'ed.
Change that to print(fileList)
For "c) Iterate through the list of files and...":
Iterate through the fileList and call the HashFile() function for each file. The return value is the key for your dictionary and the filepath is the value:
for filepath in fileList:
filehash = HashFile(filepath)
fileHashes[filehash] = filepath
The one-line version of that, using a dictionary comprehension:
fileHashes = {HashFile(filepath): filepath for filepath in fileList}
For "d) Iterate through the dictionary": I think you'll be able to manage that on your own. (See dict.items() for more info.)
Other notes:
A.) In the except block for calculating hashes, returning a string of the error - was the pre-written for you in the assignment or did you write it? If you wrote it, consider removing it - coz then that error message becomes the hash of the file since you're returning it. Possibly just print the error; unless you were instructed otherwise.
B.) The "input filename" part is not needed - you'll be calculating the hash of the all files in the "current" directory where your script is executed (as shown above).
C.) Btw, for md5, since finding collisions is a known flaw, may want to add each file as a list of values. Maybe it's part of the test. Or can be safely ignored.
I have the following question.
Let say I have a list of text files in a folder:
D:/Users/Roger/A
And another list of text files in another folder:
D:/Users/Roger/Reports
(The lists are the complete path to them), and they are ordered alphabetically so they match 1:1 for example:
FOLDER_A = ["D:/Users/Roger/A/a.txt", "D:/Users/Roger/A/b.txt"]
FOLDER_B = ["D:/Users/Roger/B/a.txt", "D:/Users/Roger/A/b.txt"]
I made a dictionary using both lists
Dict = {}
for i in range(len(FOLDER_A)):
Dict[FOLDER_A[i]] = FOLDER_B[i]
sorted(Dict.items())
Later on, I copied the information of a.txt in folder A to the a.txt file in folder B, doing a for loop, that iterated between the key and value of the dictionary.
My question: Is there a form to this by using some kind of recursion? instead of iterating through k,v in a dictionary with python 2.7.??
Thank you very much!
There is a form of recursion, as with all iterative algorithms there will be an alternate form. However, the recursive version is rarely used because of the likelihood of generating a stack overflow which would be due to the length of the list being longer than the stack space.
Recursive algorithms can be very expressive, but to me, the organisation of the data is asking to be iterated over.
btw your dict can be created with a dictionary comprehension:
Dict = { FOLDER_A[i]]:FOLDER_B[i] for i in range(len(FOLDER_A)) }
You don't need any recursion here. You can iterate through the files in the first folder and compare the names and then copy the information from the in the first folder to the file in the second folder, some thing like that:
import os
folder1= "folder1"
folder2 = "folder2"
for root, dirs, files in os.walk(folder1):
for file in files:
if file in os.listdir(folder2):
file_path = os.path.join(folder1, file)
file1_path = os.path.join(folder2, file)
if os.path.isfile(file1_path):
with open(file_path) as f:
lines = f.readlines()
with open(file1_path, 'a') as f1:
f1.writelines(lines)
I'm trying to have the bottom part of the code iterate over some files. These files should be corresponding and are differentiated by a number, so the counter is to change the number part of the file.
The file names are generated by looking through the given files and selecting files containing certain things in the title, then having them ordered using the count.
This code works independently, in it's own (lonely) folder, and prints the correct files in the correct order. However when i use this in my main code, where file_1 and file_2 are referenced (the decoder and encoder parts of the code) I get the error in the title. There is no way there is any typo or that the files don't exist because python made these things itself based on existing file names.
import os
count = 201
while 205 > count:
indir = 'absolute_path/models'
for root, dirs, filenames in os.walk(indir):
for f in filenames:
if 'test-decoder' in f:
if f.endswith(".model"):
if str(count) in f:
file_1 = f
print(file_1)
indir = 'absolute_path/models'
for root, dirs, filenames in os.walk(indir):
for f in filenames:
if 'test-encoder' in f:
if f.endswith(".model"):
if str(count) in f:
file_2 = f
print(file_2)
decoder1.load_state_dict(
torch.load(open(file_1, 'rb')))
encoder1.load_state_dict(
torch.load(open(file_2, 'rb')))
print(getBlueScore(encoder1, decoder1, pairs, src, tgt))
print_every=10
print(file_1 + file_2)
count = count + 1
i then need to use these files two by two.
It's very possible that you are running into issues with variable scoping, but without being able to see your entire code it's hard to know for sure.
If you know what the model files should be called, might I suggest this code:
for i in range(201, 205):
e = 'absolute_path/models/test_encoder_%d.model' % i
d = 'absolute_path/models/test_decoder_%d.model' % i
if os.path.exists(e) and os.path.exists(d):
decoder1.load_state_dict(torch.load(open(e, 'rb')))
encoder1.load_state_dict(torch.load(open(d, 'rb')))
Instead of relying on the existence of strings in a path name which could lead to errors this would force only those files you want to open to be opened. Also it gets rid of any possible scoping issues.
We could clean it up a bit more but you get the idea.
Disclosure: I am new to python. I am trying to load a dictionary with files using the hash value as my key and the file path as my value. I added a counter to ensure the dictionary was properly loaded. After running the code below, I have 78 files (Counter) but only 47 for my dictionary length. Why did it not load all 78 files? Any help is greatly appreciated!
for dirname, dirnames, filenames in os.walk('.'):
for subdirname in dirnames:
os.path.join(dirname, subdirname)
for filename in filenames:
m1 = hashlib.md5(filename)
hValue = m1.hexdigest()
pValue = os.path.join(dirname, filename)
myDict[(hValue)]=pValue
counter +=1
print len(myDict), "Dict Length"
print counter, "counter"
You call os.path.join but don't keep the value, so your first nested for loop is useless. I'm not sure what it was meant to do.
You don't need to create an md5 hash of the filename, just use the filename as the key for the dict.
You are probably missing entries because you have files with the same name in different directories. Use os.path.join(dirname, filename) as the key for the dict.
Update: you're hashing the filename. To hash the contents:
m1 = hashlib.md5(open(filename).read())
The dictionary keys need to be unique (or you will just overwrite the value corresponding to the key), and uniqueness isn't guaranteed by your method.
Since you're just hashing the filenames, if your filenames aren't unique, your hashes won't be either. Try hashing the full path.
Disclaimer: this is my first answer in stackoverflow :)
Hi #Jarid F,
I tried writing a complete program so that you can run and see for yourself. Here's the code:
import os
import hashlib
myDict = {}
counter = 0
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
#get the complete file location so that it's unique
filename_with_path = os.path.join(dirname, filename)
m1 = hashlib.md5(filename_with_path)
#to hash the content of the file:
#m1 = hashlib.md5(open(filename_with_path).read())
hValue = m1.hexdigest()
myDict[hValue] = filename_with_path
counter += 1
print len(myDict), "Dict Length"
print counter, "counter"
assert counter == len(myDict)
To add on a few points which #Ned Batchelder has provided:
The line myDict[(hValue)]=pValue is actually the same as myDict[hValue] = pValue but I recommend not adding the () in. That will cause confusion later when you start working with tuples
Hashing the content of the filename may not be want you want, since even if two files are different but they have the same content (like 2 empty files) they will yeild the same hash value. I guess that defeats the purpose you're trying to achieve here. Instead if I may suggest, you could take hash(hash(file_location)+hash(file_content)+some_secret_key) to make the hash key better. (Please pardon my cautious in adding the secret key as an extra security measure)
Good luck with your code & welcome to python!