Disclosure: I am new to python. I am trying to load a dictionary with files using the hash value as my key and the file path as my value. I added a counter to ensure the dictionary was properly loaded. After running the code below, I have 78 files (Counter) but only 47 for my dictionary length. Why did it not load all 78 files? Any help is greatly appreciated!
for dirname, dirnames, filenames in os.walk('.'):
for subdirname in dirnames:
os.path.join(dirname, subdirname)
for filename in filenames:
m1 = hashlib.md5(filename)
hValue = m1.hexdigest()
pValue = os.path.join(dirname, filename)
myDict[(hValue)]=pValue
counter +=1
print len(myDict), "Dict Length"
print counter, "counter"
You call os.path.join but don't keep the value, so your first nested for loop is useless. I'm not sure what it was meant to do.
You don't need to create an md5 hash of the filename, just use the filename as the key for the dict.
You are probably missing entries because you have files with the same name in different directories. Use os.path.join(dirname, filename) as the key for the dict.
Update: you're hashing the filename. To hash the contents:
m1 = hashlib.md5(open(filename).read())
The dictionary keys need to be unique (or you will just overwrite the value corresponding to the key), and uniqueness isn't guaranteed by your method.
Since you're just hashing the filenames, if your filenames aren't unique, your hashes won't be either. Try hashing the full path.
Disclaimer: this is my first answer in stackoverflow :)
Hi #Jarid F,
I tried writing a complete program so that you can run and see for yourself. Here's the code:
import os
import hashlib
myDict = {}
counter = 0
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
#get the complete file location so that it's unique
filename_with_path = os.path.join(dirname, filename)
m1 = hashlib.md5(filename_with_path)
#to hash the content of the file:
#m1 = hashlib.md5(open(filename_with_path).read())
hValue = m1.hexdigest()
myDict[hValue] = filename_with_path
counter += 1
print len(myDict), "Dict Length"
print counter, "counter"
assert counter == len(myDict)
To add on a few points which #Ned Batchelder has provided:
The line myDict[(hValue)]=pValue is actually the same as myDict[hValue] = pValue but I recommend not adding the () in. That will cause confusion later when you start working with tuples
Hashing the content of the filename may not be want you want, since even if two files are different but they have the same content (like 2 empty files) they will yeild the same hash value. I guess that defeats the purpose you're trying to achieve here. Instead if I may suggest, you could take hash(hash(file_location)+hash(file_content)+some_secret_key) to make the hash key better. (Please pardon my cautious in adding the secret key as an extra security measure)
Good luck with your code & welcome to python!
Related
I am trying to create a dictionary as follows:
key( file_size checksum ) = filename.
I want the double keywords to make up a key taking into account both value. These keys are derived from the actual file in question. If the key is matched, I have a duplicate file, not just a duplicate file name.
It is easy to determine duplicity if there was a single key:filename. But not all files will have the same filename, either by path or in actual filename. So far, no Python website has been able to supply an answer. Although one did have this format, I haven't found it again.
I have tried various combinations of brackets and commas with little effect.
A simple example that finds duplicate files in subdirs using a dictionary like you suggest:
from pathlib import Path
import hashlib
found = {}
for fn in Path('.').glob('**/*'):
if fn.is_file():
with open(fn, 'rb') as f:
key = (fn.stat().st_size, hashlib.md5(f.read()).hexdigest())
if key in found:
print(f'File {fn} is a duplicate of {found[key]}')
else:
found[key] = fn
Specifically, key is a tuple of two integers, the first the file size and the second an MD5 hash (i.e. a sort of checksum, as you suggest).
Note that this bit of code reads every file it encounters in its entirety, so I suggest pointing it at a small or moderate file collection. Computing a full checksum or hash for each file isn't a very fast way to find duplicates.
You should consider looking at more basic attributes first (size, hash for the first few bytes, etc.) and only doing a full hash if you have candidates for being duplicates.
Also, the odds of two files having the same MD5 hash, but somehow different sizes is astronomically small, so adding the file size to the key here is more or less pointless if you hash the whole file.
(Edit) This is a somewhat better way to achieve the same, a lot faster:
from pathlib import Path
import hashlib
import filecmp
found = {}
for fn in Path('.').glob('**/*'):
if fn.is_file():
with open(fn, 'rb') as f:
key = (fn.stat().st_size, hashlib.md5(f.read(1024)).hexdigest())
if key in found:
if filecmp.cmp(fn, found[key], shallow=False):
print(f'File {fn} is a duplicate of {found[key]}')
else:
found[key] = fn
I have this:
directory = os.path.join("/home","path")
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
f=open(file)
f.close()
and 'files' contains about 300 csv files like:
['graph_2020-08-04_2020-08-17.csv',
'graph_2020-04-11_2020-04-24.csv',
'graph_2021-02-05_2021-02-18.csv',
...]
I basically want to add a name to each of these files, so that i have file1, file2, file3 ... for all of them. So if I call file1, it contains graph_2020-08-04_2020-08-17.csv for example. This is what i have:
for i in files:
file[i] = files[i]
But it returns
TypeError: list indices must be integers or slices, not str
What am I doing wrong in my approach?
files is a list with strings in it, not integers. So when you say 'for i in files,' you are telling it that 'i' is a string. Then when you try to do file[i], it gives an error because you 'i' is a string, not an int. So instead of saying 'for i in files', you could say 'for i in range(files.size)' or something like that
You can use the builtin function exec() to execute a string as Python code. An example of how to do this:
file_number = 1 # A counter to keep track of the number of files you have found
directory = os.path.join("/home","path")
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
exec("file" + str(file_number) + " = " + file)
file_number += 1
print(file1) # Example usage
I think this is what you wanted to achieve.
Just go with the enumerate(iterable,start) function:
for i,file_name in enumerate(files): # or optionally you can pass a start position which by default is set to 0 and we don't need to modify it in your case
file[i] = file_name
Do remember to initialize a seperate file = [] list as i see you are using file as a loop variable and that way you can't pass it a value and an index,that way the program might throw an error. Do learn about the enumerate() function it saves a lot...Happy Coding..:)
What I understood from your question is that you have a list with n element. and you want to assign each element to n different variables. I can't understand why would you need this but:
#wolfenstein11x is right. with that for loop you'll have each element. However let's say you fix that. With the code you wrote you will assign each element of files to a list named file. (considering it has at least n element)
If what I understood is right and you really want n different variable for each element in files list (still I don't know why would you need it) you might take a look to exec.
Edit after reply:
You can read each file in a loop and do what ever with the content:
directory = os.path.join("/home","path")
for root,dirs,files in os.walk(directory):
for file in files:
with open(file, "r") as conetnt:
print(conetnt.read())
Use for i in range(files.size) instead of for i in files.
Hey guys I was able to get the hashes to individually work for the files I input. I was wondering how to add the file name to the list I have set and the hash to the other dictionary. I feel like it is a pretty simple fix I am just stuck. we have to use a file path on our machines that has been established already. the folder I am using has 3 or 4 files in it. I am just trying to figure out how to add each of the hases to the lists Thanks!
from __future__ import print_function
'''
Week Two Assignment 2 - File Hashing
'''
'''
Complete the script below to do the following:
1) Add your name, date, assignment number to the top of this script
2) Using the os library and the os.walk() method
a) Create a list of all files
b) Create an empty dictionary named fileHashes
c) Iterate through the list of files and
- calculate the md5 hash of each file
- create a dictionary entry where:
key = md5 hash
value = filepath
d) Iterate through the dictionary
- print out each key, value pair
3) Submit
NamingConvention: lastNameFirstInitial_Assignment_.ext
for example: hosmerC_WK1_script.py
hosmerC_WK1_screenshot.jpg
A) Screenshot of the results in WingIDE
B) Your Script
'''
import os
import hashlib
import sys
directory = "."
fileList = []
fileHashes = {}
# Psuedo Constants
SCRIPT_NAME = "Script: ASSIGNMENT NAME"
SCRIPT_AUTHOR = "Author: NAME"
SCRIPT_DATE = "Date: 25 January 2021"
print(SCRIPT_NAME)
print(SCRIPT_AUTHOR)
print(SCRIPT_DATE)
for root, dirs, files in os.walk(directory):
# Walk the path from top to bottom.
# For each file obtain the filename
for fileName in files:
path = os.path.join(root, fileName)
fullPath = os.path.abspath(path)
print(files)
''' Determine which version of Python '''
if sys.version_info[0] < 3:
PYTHON_2 = True
else:
PYTHON_2 = False
def HashFile(filePath):
'''
function takes one input a valid filePath
returns the hexdigest of the file
or error
'''
try:
with open(filePath, 'rb') as fileToHash:
fileContents = fileToHash.read()
hashObj = hashlib.md5()
hashObj.update(fileContents)
digest = hashObj.hexdigest()
return digest
except Exception as err:
return str(err)
print()
if PYTHON_2:
fileName = raw_input("Enter file to hash: ")
else:
fileName = input("Enter file to hash: ")
hexDigest = HashFile(fileName)
print(hexDigest)
Well, you've done most of the work in the assignment, so kudos to you for that. You just need to refine a few things and use your functions together.
For item "a) Create a list of all files": In the for fileName in files: loop, add the line fileList.append(fullPath). (See list.append() for more info.)
indent it so it's part of the for loop.
Btw, the print(files) line you have is outside the loop so it will only print the files of the last folder that was os.walk'ed.
Change that to print(fileList)
For "c) Iterate through the list of files and...":
Iterate through the fileList and call the HashFile() function for each file. The return value is the key for your dictionary and the filepath is the value:
for filepath in fileList:
filehash = HashFile(filepath)
fileHashes[filehash] = filepath
The one-line version of that, using a dictionary comprehension:
fileHashes = {HashFile(filepath): filepath for filepath in fileList}
For "d) Iterate through the dictionary": I think you'll be able to manage that on your own. (See dict.items() for more info.)
Other notes:
A.) In the except block for calculating hashes, returning a string of the error - was the pre-written for you in the assignment or did you write it? If you wrote it, consider removing it - coz then that error message becomes the hash of the file since you're returning it. Possibly just print the error; unless you were instructed otherwise.
B.) The "input filename" part is not needed - you'll be calculating the hash of the all files in the "current" directory where your script is executed (as shown above).
C.) Btw, for md5, since finding collisions is a known flaw, may want to add each file as a list of values. Maybe it's part of the test. Or can be safely ignored.
I'm new to python and get stuck by a problem I encountered while studying loops and folder navigation.
The task is simple: loop through a folder and count all '.txt' files.
I believe there may be some modules to tackle this task easily and I would appreciate it if you can share them. But since this is just a random question I encountered while learning python, it would be nice if this can be solved using the tools I just acquired, like for/while loops.
I used for and while clauses to loop through a folder. However, I'm unable to loop through a folder entirely.
Here is the code I used:
import os
count=0 # set count default
path = 'E:\\' # set path
while os.path.isdir(path):
for file in os.listdir(path): # loop through the folder
print(file) # print text to keep track the process
if file.endswith('.txt'):
count+=1
print('+1') #
elif os.path.isdir(os.path.join(path,file)): #if it is a subfolder
print(os.path.join(path,file))
path=os.path.join(path,file)
print('is dir')
break
else:
path=os.path.join(path,file)
Since the number of files and subfolders in a folder is unknown, I think a while loop is appropriate here. However, my code has many errors or pitfalls I don't know how to fix. for example, if multiple subfolders exist, this code will only loop the first subfolder and ignore the rest.
Your problem is that you quickly end up trying to look at non-existent files. Imagine a directory structure where a non-directory named A (E:\A) is seen first, then a file b (E:\b).
On your first loop, you get A, detect it does not end in .txt, and that it is a directory, so you change path to E:\A.
On your second iteration, you get b (meaning E:\b), but all your tests (aside from the .txt extension test) and operations concatenate it with the new path, so you test relative to E:\A\b, not E:\b.
Similarly, if E:\A is a directory, you break the inner loop immediately, so even if E:\c.txt exists, if it occurs after A in the iteration order, you never even see it.
Directory tree traversal code must involve a stack of some sort, either explicitly (by appending and poping from a list of directories for eventual processing), or implicitly (via recursion, which uses the call stack to achieve the same purpose).
In any event, your specific case should really just be handled with os.walk:
for root, dirs, files in os.walk(path):
print(root) # print text to keep track the process
count += sum(1 for f in files if f.endswith('txt'))
# This second line matches your existing behavior, but might not be intended
# Remove it if directories ending in .txt should not be included in the count
count += sum(1 for d in files if d.endswith('txt'))
Just for illustration, the explicit stack approach to your code would be something like:
import os
count = 0 # set count default
paths = ['E:\\'] # Make stack of paths to process
while paths:
# paths.pop() gets top of directory stack to process
# os.scandir is easier and more efficient than os.listdir,
# though it must be closed (but with statement does this for us)
with os.scandir(paths.pop()) as entries:
for entry in entries: # loop through the folder
print(entry.name) # print text to keep track the process
if entry.name.endswith('.txt'):
count += 1
print('+1')
elif entry.is_dir(): #if it is a subfolder
print(entry.path, 'is dir')
# Add to paths stack to get to it eventually
paths.append(entry.path)
You probably want to apply recursion to this problem. In short, you will need a function to handle directories that will call itself when it encounters a sub-directory.
This might be more than you need, but it will allow you to list all the files within the directory that are .txt files but you can also add criteria to the search within the files as well. Here is the function:
def file_search(root,extension,search,search_type):
import pandas as pd
import os
col1 = []
col2 = []
rootdir = root
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if "." + extension in file.lower():
try:
with open(os.path.join(subdir, file)) as f:
contents = f.read()
if search_type == 'any':
if any(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
elif search_type == 'all':
if all(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
except:
pass
df = pd.DataFrame({'Folder':col1,
'File':col2})[['Folder','File']]
return df
Here is an example of how to use the function:
search_df = file_search(root = r'E:\\',
search=['foo','bar'], #words to search for
extension = 'txt', #could change this to 'csv' or 'sql' etc.
search_type = 'all') #use any or all
search_df
The analysis of your code has already been addressed by #ShadowRanger's answer quite well.
I will try to address this part of your question:
there may be some modules to tackle this task easily
For these kind of tasks, there actually exists the glob module, which implements Unix style pathname pattern expansion.
To count the number of .txt files in a directory and all its subdirectories, one may simply use the following:
import os
from glob import iglob, glob
dirpath = '.' # for example
# getting all matching elements in a list a computing its length
len(glob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
# or iterating through all matching elements and summing 1 each time a new item is found
# (this approach is more memory-efficient)
sum(1 for _ in iglob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
Basically glob.iglob() is the iterator version of glob.glob().
for nested Directories it's easier to use functions like os.walk
take this for example
subfiles = []
for dirpath, subdirs, files in os.walk(path):
for x in files:
if x.endswith(".txt"):
subfiles.append(os.path.join(dirpath, x))`
and it'ill return a list of all txt files
else ull need to use Recursion for task like this
I am using a SHA1 hash to identify identical files within directories and then delete the duplicates. Whenever I run my script, it seems that an appropriate hash is created for most of my files. However, I have several files that are very different that are coming up with the same hash. I know it is virtually impossible for collisions like this to occur, so am wondering if there is something wrong in my code below:
hashmap = {}
for path, dirs, files in os.walk(maindirectory):
for filename in files:
fullname = os.path.join(path, filename)
with open(fullname) as f:
d = f.read()
h = hashlib.sha1(d).hexdigest()
filelist = hashmap.setdefault(h, [])
filelist.append(fullname)
# delete records in dictionary that have only 1 item (meaning no duplicate)
for k, v in hashmap.items():
if len(v) == 1:
del hashmap[k]
When I return my dictionary, hashmap, there are a few predictable duplicates and then there is one hash with about 40 unique files. Any suggestions, tips, or explanations would be most appreciated!