Python Dictionary: Multi keywords key with single identical value - python

I am trying to create a dictionary as follows:
key( file_size checksum ) = filename.
I want the double keywords to make up a key taking into account both value. These keys are derived from the actual file in question. If the key is matched, I have a duplicate file, not just a duplicate file name.
It is easy to determine duplicity if there was a single key:filename. But not all files will have the same filename, either by path or in actual filename. So far, no Python website has been able to supply an answer. Although one did have this format, I haven't found it again.
I have tried various combinations of brackets and commas with little effect.

A simple example that finds duplicate files in subdirs using a dictionary like you suggest:
from pathlib import Path
import hashlib
found = {}
for fn in Path('.').glob('**/*'):
if fn.is_file():
with open(fn, 'rb') as f:
key = (fn.stat().st_size, hashlib.md5(f.read()).hexdigest())
if key in found:
print(f'File {fn} is a duplicate of {found[key]}')
else:
found[key] = fn
Specifically, key is a tuple of two integers, the first the file size and the second an MD5 hash (i.e. a sort of checksum, as you suggest).
Note that this bit of code reads every file it encounters in its entirety, so I suggest pointing it at a small or moderate file collection. Computing a full checksum or hash for each file isn't a very fast way to find duplicates.
You should consider looking at more basic attributes first (size, hash for the first few bytes, etc.) and only doing a full hash if you have candidates for being duplicates.
Also, the odds of two files having the same MD5 hash, but somehow different sizes is astronomically small, so adding the file size to the key here is more or less pointless if you hash the whole file.
(Edit) This is a somewhat better way to achieve the same, a lot faster:
from pathlib import Path
import hashlib
import filecmp
found = {}
for fn in Path('.').glob('**/*'):
if fn.is_file():
with open(fn, 'rb') as f:
key = (fn.stat().st_size, hashlib.md5(f.read(1024)).hexdigest())
if key in found:
if filecmp.cmp(fn, found[key], shallow=False):
print(f'File {fn} is a duplicate of {found[key]}')
else:
found[key] = fn

Related

Adding hashes and files names to lists and dictionaries

Hey guys I was able to get the hashes to individually work for the files I input. I was wondering how to add the file name to the list I have set and the hash to the other dictionary. I feel like it is a pretty simple fix I am just stuck. we have to use a file path on our machines that has been established already. the folder I am using has 3 or 4 files in it. I am just trying to figure out how to add each of the hases to the lists Thanks!
from __future__ import print_function
'''
Week Two Assignment 2 - File Hashing
'''
'''
Complete the script below to do the following:
1) Add your name, date, assignment number to the top of this script
2) Using the os library and the os.walk() method
a) Create a list of all files
b) Create an empty dictionary named fileHashes
c) Iterate through the list of files and
- calculate the md5 hash of each file
- create a dictionary entry where:
key = md5 hash
value = filepath
d) Iterate through the dictionary
- print out each key, value pair
3) Submit
NamingConvention: lastNameFirstInitial_Assignment_.ext
for example: hosmerC_WK1_script.py
hosmerC_WK1_screenshot.jpg
A) Screenshot of the results in WingIDE
B) Your Script
'''
import os
import hashlib
import sys
directory = "."
fileList = []
fileHashes = {}
# Psuedo Constants
SCRIPT_NAME = "Script: ASSIGNMENT NAME"
SCRIPT_AUTHOR = "Author: NAME"
SCRIPT_DATE = "Date: 25 January 2021"
print(SCRIPT_NAME)
print(SCRIPT_AUTHOR)
print(SCRIPT_DATE)
for root, dirs, files in os.walk(directory):
# Walk the path from top to bottom.
# For each file obtain the filename
for fileName in files:
path = os.path.join(root, fileName)
fullPath = os.path.abspath(path)
print(files)
''' Determine which version of Python '''
if sys.version_info[0] < 3:
PYTHON_2 = True
else:
PYTHON_2 = False
def HashFile(filePath):
'''
function takes one input a valid filePath
returns the hexdigest of the file
or error
'''
try:
with open(filePath, 'rb') as fileToHash:
fileContents = fileToHash.read()
hashObj = hashlib.md5()
hashObj.update(fileContents)
digest = hashObj.hexdigest()
return digest
except Exception as err:
return str(err)
print()
if PYTHON_2:
fileName = raw_input("Enter file to hash: ")
else:
fileName = input("Enter file to hash: ")
hexDigest = HashFile(fileName)
print(hexDigest)
Well, you've done most of the work in the assignment, so kudos to you for that. You just need to refine a few things and use your functions together.
For item "a) Create a list of all files": In the for fileName in files: loop, add the line fileList.append(fullPath). (See list.append() for more info.)
indent it so it's part of the for loop.
Btw, the print(files) line you have is outside the loop so it will only print the files of the last folder that was os.walk'ed.
Change that to print(fileList)
For "c) Iterate through the list of files and...":
Iterate through the fileList and call the HashFile() function for each file. The return value is the key for your dictionary and the filepath is the value:
for filepath in fileList:
filehash = HashFile(filepath)
fileHashes[filehash] = filepath
The one-line version of that, using a dictionary comprehension:
fileHashes = {HashFile(filepath): filepath for filepath in fileList}
For "d) Iterate through the dictionary": I think you'll be able to manage that on your own. (See dict.items() for more info.)
Other notes:
A.) In the except block for calculating hashes, returning a string of the error - was the pre-written for you in the assignment or did you write it? If you wrote it, consider removing it - coz then that error message becomes the hash of the file since you're returning it. Possibly just print the error; unless you were instructed otherwise.
B.) The "input filename" part is not needed - you'll be calculating the hash of the all files in the "current" directory where your script is executed (as shown above).
C.) Btw, for md5, since finding collisions is a known flaw, may want to add each file as a list of values. Maybe it's part of the test. Or can be safely ignored.

Compare one directory at time 1 to same directory at time 2

My goal : compare the content of one directory (including sub-directories and files) at time 1 to the content of the same directory at time 2 (e.g. 6 months later). "Content" means : number and names of the subdirectories + number and names and size of files. The main intended outcome is : being sure that no files were destroyed or corrupted in the mean time.
I did not find any existing tool, although I was wondering whether https://github.com/njanakiev/folderstats folderstats could help.
Would you have any suggestion of modules or anything to start well? If you heard about an existing tool for this, I would also be interested.
Thanks.
Here's some code that should help to get you started. It defines a function that will build a data structure of nested dictionaries that correspond the contents of the starting root directory and everything below it in the filesystem. Each each item dictionary that has with a 'type' key with the value 'file', will also have a 'stat' key that can contain whatever file metadata you want or need, such as time of creation, last modification time, length in bytes, … etc.
You can use it to obtain a "before" and "after" snapshots of the directory you're tracking and use them for comparison purposes. I've left the latter (the comparing) out since I'm not sure exactly what you're interested in.
Note that when I actually went about implementing this, I found it simpler to write a recursive function than to use os.walk(), as I suggested in a comment.
The following implements a version of the function and prints out the data structure of nested dictionaries it returns.
import os
from pathlib import PurePath
def path_to_dict(path):
result = {}
result['full_path'] = PurePath(path).as_posix()
if os.path.isdir(path):
result['type'] = 'dir'
result['items'] = {filename: path_to_dict(os.path.join(path, filename))
for filename in os.listdir(path)}
else:
result['type'] = 'file'
result['stat'] = 'os.stat(path)' # Preserve any needed metadata.
return result
root = './folder' # Change as desired.
before = path_to_dict(root)
# Pretty-print data structure created.
from pprint import pprint
pprint(before, sort_dicts=False)

How to sort file names in a particular order using python

Is there a simple way to sort files in a directory in python? The files I have in mind come in an ordering as
file_01_001
file_01_005
...
file_02_002
file_02_006
...
file_03_003
file_03_007
...
file_04_004
file_04_008
What I want is something like
file_01_001
file_02_002
file_03_003
file_04_004
file_01_005
file_02_006
...
I am currently opening them using glob for the directory as follows:
for filename in glob(path):
with open(filename,'rb') as thefile:
#Do stuff to each file
So, while the program performs the desired tasks, it's giving incorrect data if I do more than one file at a time, due to the ordering of the files. Any ideas?
As mentioned, files in a directory are not inherently sorted in a particular way. Thus, we usually 1) grab the file names 2) sort the file names by desired property 3) process the files in the sorted order.
You can get the file names in the directory as follows. Suppose the directory is "~/home" then
import os
file_list = os.listdir("~/home")
To sort file names:
#grab last 4 characters of the file name:
def last_4chars(x):
return(x[-4:])
sorted(file_list, key = last_4chars)
So it looks as follows:
In [4]: sorted(file_list, key = last_4chars)
Out[4]:
['file_01_001',
'file_02_002',
'file_03_003',
'file_04_004',
'file_01_005',
'file_02_006',
'file_03_007',
'file_04_008']
To read in and process them in sorted order, do:
file_list = os.listdir("~/home")
for filename in sorted(file_list, key = last_4chars):
with open(filename,'rb') as thefile:
#Do stuff to each file
A much better solution is to use Tcl's "lsort -dictionary":
from tkinter import Tcl
Tcl().call('lsort', '-dict', file_list)
Tcl dictionary sorting will treat numbers correctly, and you will get results similar to the ones a file manager uses for sorting files.

Creating Unique Names

I'm creating a corpus from a repository. I download the text from the repository in pdf, convert these to text files, and save them. However, I'm trying to find a good way to name these files.
To get the filenames I do this: (the records generator is an object from the Sickle package that I use to get access to all the records in the repository)
for record in records:
record_data = [] # data is stored in record_data
for name, metadata in record.metadata.items():
for i, value in enumerate(metadata):
if value:
record_data.append(value)
file_path = ''
fulltext = ''
for data in record_data:
if 'Fulltext' in data:
fulltext = data.replace('Fulltext ', '')
file_path = '/' + os.path.basename(data) + '.txt'
print fulltext
print file_path
The print statements on the two last lines:
https://www.duo.uio.no/bitstream/handle/10852/34910/1/Bertelsen-Master.pdf
/Bertelsen-Master.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34912/1/thesis-output.pdf
/thesis-output.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9976/1/gartmann.pdf
/gartmann.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34174/1/thesis-mariusno.pdf
/thesis-mariusno.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9285/1/thesis2.pdf
/thesis2.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9360/1/OMyhre.pdf
As you can see I add a .txt to the end of the original filename and want to use that name to save the file. However, a lot of the files have the same filename, like thesis.pdf. One way I thought about solving this was to add a few random numbers to the name, or have a number that gets incremented on each record and use that, like this: thesis.pdf.124.txt (adding 124 to the name).
But that does not look very good, and the repository is huge, so in the end I would have quite large numbers appended to each filename. Any smart suggestions on how I can solve this?
I have seen suggestions like using the time module. I was thinking maybe I can use regex or another technique to extract part of the name (so every name is equally long) and then create a method that adds a string to each file pased on the url of the file, which should be unique.
One thing you could do is to compute a unique hash of the files, e.g. with MD5 or SHA1 (or any other), cf. this article. For a large number of files this can become quite slow, though.
But you don't really see to touch the files in this piece of code. For generating some unique id, you could use uuid and put this somewhere in the name.

Incomplete loading of a dictionary

Disclosure: I am new to python. I am trying to load a dictionary with files using the hash value as my key and the file path as my value. I added a counter to ensure the dictionary was properly loaded. After running the code below, I have 78 files (Counter) but only 47 for my dictionary length. Why did it not load all 78 files? Any help is greatly appreciated!
for dirname, dirnames, filenames in os.walk('.'):
for subdirname in dirnames:
os.path.join(dirname, subdirname)
for filename in filenames:
m1 = hashlib.md5(filename)
hValue = m1.hexdigest()
pValue = os.path.join(dirname, filename)
myDict[(hValue)]=pValue
counter +=1
print len(myDict), "Dict Length"
print counter, "counter"
You call os.path.join but don't keep the value, so your first nested for loop is useless. I'm not sure what it was meant to do.
You don't need to create an md5 hash of the filename, just use the filename as the key for the dict.
You are probably missing entries because you have files with the same name in different directories. Use os.path.join(dirname, filename) as the key for the dict.
Update: you're hashing the filename. To hash the contents:
m1 = hashlib.md5(open(filename).read())
The dictionary keys need to be unique (or you will just overwrite the value corresponding to the key), and uniqueness isn't guaranteed by your method.
Since you're just hashing the filenames, if your filenames aren't unique, your hashes won't be either. Try hashing the full path.
Disclaimer: this is my first answer in stackoverflow :)
Hi #Jarid F,
I tried writing a complete program so that you can run and see for yourself. Here's the code:
import os
import hashlib
myDict = {}
counter = 0
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
#get the complete file location so that it's unique
filename_with_path = os.path.join(dirname, filename)
m1 = hashlib.md5(filename_with_path)
#to hash the content of the file:
#m1 = hashlib.md5(open(filename_with_path).read())
hValue = m1.hexdigest()
myDict[hValue] = filename_with_path
counter += 1
print len(myDict), "Dict Length"
print counter, "counter"
assert counter == len(myDict)
To add on a few points which #Ned Batchelder has provided:
The line myDict[(hValue)]=pValue is actually the same as myDict[hValue] = pValue but I recommend not adding the () in. That will cause confusion later when you start working with tuples
Hashing the content of the filename may not be want you want, since even if two files are different but they have the same content (like 2 empty files) they will yeild the same hash value. I guess that defeats the purpose you're trying to achieve here. Instead if I may suggest, you could take hash(hash(file_location)+hash(file_content)+some_secret_key) to make the hash key better. (Please pardon my cautious in adding the secret key as an extra security measure)
Good luck with your code & welcome to python!

Categories

Resources