how to loop through folders thoroughly? python - python

I'm new to python and get stuck by a problem I encountered while studying loops and folder navigation.
The task is simple: loop through a folder and count all '.txt' files.
I believe there may be some modules to tackle this task easily and I would appreciate it if you can share them. But since this is just a random question I encountered while learning python, it would be nice if this can be solved using the tools I just acquired, like for/while loops.
I used for and while clauses to loop through a folder. However, I'm unable to loop through a folder entirely.
Here is the code I used:
import os
count=0 # set count default
path = 'E:\\' # set path
while os.path.isdir(path):
for file in os.listdir(path): # loop through the folder
print(file) # print text to keep track the process
if file.endswith('.txt'):
count+=1
print('+1') #
elif os.path.isdir(os.path.join(path,file)): #if it is a subfolder
print(os.path.join(path,file))
path=os.path.join(path,file)
print('is dir')
break
else:
path=os.path.join(path,file)
Since the number of files and subfolders in a folder is unknown, I think a while loop is appropriate here. However, my code has many errors or pitfalls I don't know how to fix. for example, if multiple subfolders exist, this code will only loop the first subfolder and ignore the rest.

Your problem is that you quickly end up trying to look at non-existent files. Imagine a directory structure where a non-directory named A (E:\A) is seen first, then a file b (E:\b).
On your first loop, you get A, detect it does not end in .txt, and that it is a directory, so you change path to E:\A.
On your second iteration, you get b (meaning E:\b), but all your tests (aside from the .txt extension test) and operations concatenate it with the new path, so you test relative to E:\A\b, not E:\b.
Similarly, if E:\A is a directory, you break the inner loop immediately, so even if E:\c.txt exists, if it occurs after A in the iteration order, you never even see it.
Directory tree traversal code must involve a stack of some sort, either explicitly (by appending and poping from a list of directories for eventual processing), or implicitly (via recursion, which uses the call stack to achieve the same purpose).
In any event, your specific case should really just be handled with os.walk:
for root, dirs, files in os.walk(path):
print(root) # print text to keep track the process
count += sum(1 for f in files if f.endswith('txt'))
# This second line matches your existing behavior, but might not be intended
# Remove it if directories ending in .txt should not be included in the count
count += sum(1 for d in files if d.endswith('txt'))
Just for illustration, the explicit stack approach to your code would be something like:
import os
count = 0 # set count default
paths = ['E:\\'] # Make stack of paths to process
while paths:
# paths.pop() gets top of directory stack to process
# os.scandir is easier and more efficient than os.listdir,
# though it must be closed (but with statement does this for us)
with os.scandir(paths.pop()) as entries:
for entry in entries: # loop through the folder
print(entry.name) # print text to keep track the process
if entry.name.endswith('.txt'):
count += 1
print('+1')
elif entry.is_dir(): #if it is a subfolder
print(entry.path, 'is dir')
# Add to paths stack to get to it eventually
paths.append(entry.path)

You probably want to apply recursion to this problem. In short, you will need a function to handle directories that will call itself when it encounters a sub-directory.

This might be more than you need, but it will allow you to list all the files within the directory that are .txt files but you can also add criteria to the search within the files as well. Here is the function:
def file_search(root,extension,search,search_type):
import pandas as pd
import os
col1 = []
col2 = []
rootdir = root
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if "." + extension in file.lower():
try:
with open(os.path.join(subdir, file)) as f:
contents = f.read()
if search_type == 'any':
if any(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
elif search_type == 'all':
if all(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
except:
pass
df = pd.DataFrame({'Folder':col1,
'File':col2})[['Folder','File']]
return df
Here is an example of how to use the function:
search_df = file_search(root = r'E:\\',
search=['foo','bar'], #words to search for
extension = 'txt', #could change this to 'csv' or 'sql' etc.
search_type = 'all') #use any or all
search_df

The analysis of your code has already been addressed by #ShadowRanger's answer quite well.
I will try to address this part of your question:
there may be some modules to tackle this task easily
For these kind of tasks, there actually exists the glob module, which implements Unix style pathname pattern expansion.
To count the number of .txt files in a directory and all its subdirectories, one may simply use the following:
import os
from glob import iglob, glob
dirpath = '.' # for example
# getting all matching elements in a list a computing its length
len(glob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
# or iterating through all matching elements and summing 1 each time a new item is found
# (this approach is more memory-efficient)
sum(1 for _ in iglob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
Basically glob.iglob() is the iterator version of glob.glob().

for nested Directories it's easier to use functions like os.walk
take this for example
subfiles = []
for dirpath, subdirs, files in os.walk(path):
for x in files:
if x.endswith(".txt"):
subfiles.append(os.path.join(dirpath, x))`
and it'ill return a list of all txt files
else ull need to use Recursion for task like this

Related

Simple Python program that checks in each subfolder how many files there are and which extensions the file contains

I am writing a simple python script that looks in the subfolders of the selected subfolder for files and summarizes which extensions are used and how many.
I am not really familiar with os.walk and I am really stuck with the "for file in files" section
`
for file in files:
total_file_count += 1
# Get the file extension
extension = file.split(".")[-1]
# If the extension is not in the dictionary, add it
if extension not in file_counts[subfolder]:
file_counts[subfolder][extension] = 1
# If the extension is already in the dictionary, increase the count by 1
else:
file_counts[subfolder][extension] += 1
`
I thought a for loop was the best option for the loop that summarizes the files and extensions but it only takes the last subfolder and gives a output of the files that are in the last map.
Does anybody maybe have a fix or a different aproach for it?
FULL CODE:
`
import os
# Set file path using / {End with /}
root_path="C:/Users/me/Documents/"
# Initialize variables to keep track of file counts
total_file_count=0
file_counts = {}
# Iterate through all subfolders and files using os.walk
for root, dirs, files in os.walk(root_path):
# Get currenty subfolder name
subfolder = root.split("/")[-1]
print(subfolder)
# Initialize a count for each file type
file_counts[subfolder] = {}
# Iterate through all files in the subfolder
for file in files:
total_file_count += 1
# Get the file extension
extension = file.split(".")[-1]
# If the extension is not in the dictionary, add it
if extension not in file_counts[subfolder]:
file_counts[subfolder][extension] = 1
# If the extension is already in the dictionary, increase the count by 1
else:
file_counts[subfolder][extension] += 1
# Print total file count
print(f"There are a total of {total_file_count} files.")
# Print the file counts for each subfolder
for subfolder, counts in file_counts.items():
print(f"In the {subfolder} subfolder:")
for extension, count in counts.items():
print(f"There are {count} .{extension} files")
`
Thank you in advance :)
If I understand correctly, you want to count the extensions in ALL subfolders of the given folder, but are only getting one folder. If that is indeed the problem, then the issue is this loop
for root, dirs, files in os.walk(root_path):
# Get currenty subfolder name
subfolder = root.split("/")[-1]
print(subfolder)
You are iterating through os.walk, but you keep overwriting the subfolder variable. So while it will print out every subfolder, it will only remember the LAST subfolder it encounters - leading to the code returning only on subfolder.
Solution 1: Fix the loop
If you want to stick with os.walk, you just need to fix the loop. First things first - define files as a real variable. Don't rely on using the temporary variable from the loop. You actually already have this: file_counts!
Then, you need someway to save the files. I see that you want to split this up by subfolder, so what we can do is use file_counts, and use it to map each subfolder to a list of files (you are trying to do this, but are fundamentally misunderstanding some python code; see my note below about this).
So now, we have a dictionary mapping each subfolder to a list of files! We would just need to iterate through this and count the extensions. The final code looks something like this:
file_counts = {}
extension_counts = {}
# Iterate through all subfolders and files using os.walk
for root, dirs, files in os.walk(root_path):
subfolder = root.split("/")[-1]
file_counts[subfolder] = files
extensions_counts[subfolder]={}
# Iterate through all subfolders, and then through all files
for subfolder in file_counts:
for file in file_counts[subfolder]:
total_file_count += 1
extension = file.split(".")[-1]
if extension not in extension_counts[subfolder]:
extension_counts[subfolder][extension] = 1
else:
extension_counts[subfolder][extension] += 1
Solution 2: Use glob
Instead of os.walk, you can use the glob module, which will return a list of all files and directories wherever you search. It is a powerful tool that uses wildcard matching, and you can read about it here
Note
In your code, you write
# Initialize a count for each file type
file_counts[subfolder] = {}
Which feels like a MATLAB coding scheme. First, subfolder is a variable, and not a vector, so this would only initialize a count for a single file type (and even if it was a list, you get an unhashable type error). Second, this seems to stem from the idea that continuously assigning a variable in a loop builds a list instead of overwriting, which is not true. If you want to do that, you need to initialize an empty list, and use .append().
Note 2: Electric Boogaloo
There are two big ways to make this code good, and here are hints
Look into default dictionaries. They will make your code less redundant
Do you REALLY need to save the numbers and THEN count? What if you counted directly?
Rather than using os.walk you could use the rglob and glob methods of Path object. E.g.,
from pathlib import Path
root_path="C:/Users/me/Documents/"
# get a list of all the directories within root (and recursively within those subdirectories
dirs = [d for d in Path().rglob(root_path + "*") if d.is_dir()]
dirs.append(Path(root_path)) # append root directory
# loop through all directories
for curdir in dirs:
# get suffixes (i.e., extensions) of all files in the directory
suffixes = set([s.suffix for s in curdir.glob("*") if s.is_file()])
print(f"In the {curdir}:")
# loop through the suffixes
for suffix in suffixes:
# get all the files in the currect directory with that extension
suffiles = curdir.glob(f"*{suffix}")
print(f"There are {len(list(suffiles))} {suffix} files")

Given a filename, go to the next file in a directory

I am writing a method that takes a filename and a path to a directory and returns the next available filename in the directory or None if there are no files with names that would sort after the file.
There are plenty of questions about how to list all the files in a directory or iterate over them, but I am not sure if the best solution to finding a single next filename is to use the list that one of the previous answers generated and then find the location of the current file in the list and choose the next element (or None if we're already on the last one).
EDIT: here's my current file-picking code. It's reused from a different part of the project, where it is used to pick a random image from a potentially nested series of directories.
# picks a file from a directory
# if the file is also a directory, pick a file from the new directory
# this might choke up if it encounters a directory only containing invalid files
def pickNestedFile(directory, bad_files):
file=None
while file is None or file in bad_files:
file=random.choice(os.listdir(directory))
#file=directory+file # use the full path name
print "Trying "+file
if os.path.isdir(os.path.join(directory, file))==True:
print "It's a directory!"
return pickNestedFile(directory+"/"+file, bad_files)
else:
return directory+"/"+file
The program I am using this in now is to take a folder of chatlogs, pick a random log, starting position, and length. These will then be processed into a MOTD-like series of (typically) short log snippets. What I need the next-file picking ability for is when the length is unusually long or the starting line is at the end of the file, so that it continues at the top of the next file (a.k.a. wrap around midnight).
I am open to the idea of using a different method to choose the file, since the above method does not discreetly give a separate filename and directory and I'd have to go use a listdir and match to get an index anyway.
You should probably consider rewriting your program to not have to use this. But this would be how you could do it:
import os
def nextFile(filename,directory):
fileList = os.listdir(directory)
nextIndex = fileList.index(filename) + 1
if nextIndex == 0 or nextIndex == len(fileList):
return None
return fileList[nextIndex]
print(nextFile("mail","test"))
I tweaked the accepted answer to allow new files to be added to the directory on the fly and for it to work if a file is deleted or changed or doesn't exist. There are better ways to work with filenames/paths, but the example below keeps it simple. Maybe it's helpful:
import os
def next_file_in_dir(directory, current_file=None):
file_list = os.listdir(directory)
next_index = 0
if current_file in file_list:
next_index = file_list.index(current_file) + 1
if next_index >= len(file_list):
next_index = 0
return file_list[next_index]
file_name = None
directory = "videos"
user_advanced_to_next = True
while user_advanced_to_next:
file_name = next_file_in_dir(directory=directory, current_file=file_name )
user_advanced_to_next = play_video("{}/{}".format(directory, file_name))
finish_and_clean_up()

Python: how to discern if a path is within another path?

I need to know if pathA is a subset of, or is contained within pathB.
I'm making a little script that will walk some old volumes and find duplicate files. My general approach (and even if it's a bad one for it's inefficiency, it's just for me and it works, so I'm ok with the brute-forceness of it) has been:
Map all the files to a log
Create a hash for all the files in the log
Sort the hash list for duplicates
Move the duplicates somewhere for inspection prior to deletion
I want to be able to exclude certain directories, though (ie. System files). This is what I've written:
#self.search_dir = top level directory to be searched for duplicates
#self.mfl = master_file_list, being built by this func, a list of all files in search_dir
#self.no_crawl_list = list of files and directories to be excluded from the search
def build_master_file_list(self):
for root, directories, files in os.walk(self.search_dir):
files = [f for f in files if not f[0] == '.']
directories[:] = [d for d in directories if not d[0] == '.']
for filename in files:
filepath = os.path.join(root, filename)
if [root, filepath] in self.no_crawl_list:
pass
else:
self.mfl.write(filepath + "\n")
self.mfl.close()
But I'm pretty sure this isn't going to do what I'd intended. My goal is to have all subdirectories of anything in self.no_crawl_list excluded as well, such that:
if
/path/to/excluded_dir is added to self.no_crawl_list
then paths like /path/to/excluded_dir/sub_dir/implicitly_excluded_file.txt
will be skipped as well. I think my code is currently being entirely literal about what to skip. Short of exploding the path parts and comparing them to every possible combination in self.no_crawl_list, however, I don't know how to do this. 'Lil help? :)
As per the assistance of Lukas Graf in the comments above, I was able to build this and it works like a charm:
def is_subpath(self, path, of_paths):
if isinstance(of_paths, basestring): of_paths = [of_paths]
abs_of_paths = [os.path.abspath(of_path) for of_path in of_paths]
return any(os.path.abspath(path).startswith(subpath) for subpath in abs_of_paths)
Also, this currently doesn't account for symlinks and assumes a UNIX filesystem, see comments in original question for advice on extending this.

Need 'if os.havefiles' like function for subfolder search in python

I need to os.walk from my parent path (tutu), by all subfolders. For each one, each of the deepest subfolders have the files that i need to process with my code. For all the deepest folders that have files, the file 'layout' is the same: one file *.adf.txt, one file *.idf.txt, one file *.sdrf.txt and one or more files *.dat., as pictures shown.
My problem is that i don't know how to use the os module to iterate, from my parent folder, to all subfolders sequentially. I need a function that, for the current subfolder in os.walk, if that subfolder is empty, continue to the sub-subfolder inside that subfolder, if it exists. If exists, then verify if that file layout is present (this is no problem...), and if it is, then apply the code (no problem too). If not, and if that folder don't have more sub-folders, return to the parent folder and os.walk to the next subfolder, and this for all subfolders into my parent folder (tutu). To resume, i need some function like that below (written in python/imaginary code hybrid):
for all folders in tutu:
if os.havefiles in os.walk(current_path):#the 'havefiles' donĀ“t exist, i think...
for filename in os.walk(current_path):
if 'adf' in filename:
etc...
#my code
elif:
while true:
go deep
else:
os.chdir(parent_folder)
Do you think that is best a definition to call in my code to do the job?
this is the code that i've tried to use, without sucess, of course:
import csv
import os
import fnmatch
abs_path=os.path.abspath('.')
for dirname, subdirs, filenames in os.walk('.'):
# print path to all subdirectories first.
for subdirname in subdirs:
print os.path.join(dirname, subdirname), 'os.path.join(dirname, subdirname)'
current_path= os.path.join(dirname, subdirname)
os.chdir(current_path)
for filename in os.walk(current_path):
print filename, 'f in os.walk'
if os.path.isdir(filename)==True:
break
elif os.path.isfile(filename)==True:
print filename, 'file'
#code here
Thanks in advance...
I need a function that, for the current subfolder in os.walk, if that subfolder is empty, continue to the sub-subfolder inside that subfolder, if it exists.
This doesn't make any sense. If a folder is empty, it doesn't have any subfolders.
Maybe you mean that if it has no regular files, then recurse into its subfolders, but if it has any, don't recurse, and instead check the layout?
To do that, all you need is something like this:
for dirname, subdirs, filenames in os.walk('.'):
if filenames:
# can't use os.path.splitext, because that will give us .txt instead of .adf.txt
extensions = collections.Counter(filename.partition('.')[-1]
for filename in filenames)
if (extensions['.adf.txt'] == 1 and extensions['.idf.txt'] == 1 and
extensions['.sdrf.txt'] == 1 and extensions['.dat'] >= 1 and
len(extensions) == 4):
# got a match, do what you want
# Whether this is a match or not, prune the walk.
del subdirs[:]
I'm assuming here that you only want to find directories that have exactly the specified files, and no others. To remove that last restriction, just remove the len(extensions) == 4 part.
There's no need to explicitly iterate over subdirs or anything, or recursively call os.walk from inside os.walk. The whole point of walk is that it's already recursively visiting every subdirectory it finds, except when you explicitly tell it not to (by pruning the list it gives you).
os.walk will automatically "dig down" recursively, so you don't need to recurse the tree yourself.
I think this should be the basic form of your code:
import csv
import os
import fnmatch
directoriesToMatch = [list here...]
filenamesToMatch = [list here...]
abs_path=os.path.abspath('.')
for dirname, subdirs, filenames in os.walk('.'):
if len(set(directoriesToMatch).difference(subdirs))==0: # all dirs are there
if len(set(filenamesToMatch).difference(filenames))==0: # all files are there
if <any other filename/directory checking code>:
# processing code here ...
And according to the python documentation, if you for whatever reason don't want to continue recursing, just delete entries from subdirs:
http://docs.python.org/2/library/os.html
If you instead want to check that there are NO sub-directories where you find your files to process, you could also change the dirs check to:
if len(subdirs)==0: # check that this is an empty directory
I'm not sure I quite understand the question, so I hope this helps!
Edit:
Ok, so if you need to check there are no files instead, just use:
if len(filenames)==0:
But as I stated above, it would probably be better to just look FOR specific files instead of checking for empty directories.

Moving specific files in subdirectories into a directory - python

Im rather new to python but I have been attemping to learn the basics.
Anyways I have several files that once i have extracted from their zip files (painfully slow process btw) produce several hundred subdirectories with 2-3 files in each. Now what I want to do is extract all those files ending with 'dem.tif' and place them in a seperate file (move not copy).
I may have attempted to jump into the deep end here but the code i've written runs without error so it must not be finding the files (that do exist!) as it gives me the else statement. Here is the code i've created
import os
src = 'O:\DATA\ASTER GDEM\Original\North America\UTM Zone 14\USA\Extracted' # input
dst = 'O:\DATA\ASTER GDEM\Original\North America\UTM Zone 14\USA\Analyses' # desired location
def move():
for (dirpath, dirs, files) in os.walk(src):
if files.endswith('dem.tif'):
shutil.move(os.path.join(src,files),dst)
print ('Moving ', + files, + ' to ', + dst)
else:
print 'No Such File Exists'
First, welcome to the community, and python! You might want to change your user name, especially if you frequent here. :)
I suggest the following (stolen from Mr. Beazley):
# genfind.py
#
# A function that generates files that match a given filename pattern
import os
import shutil
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
# Example use
if __name__ == '__main__':
src = 'O:\DATA\ASTER GDEM\Original\North America\UTM Zone 14\USA\Extracted' # input
dst = 'O:\DATA\ASTER GDEM\Original\North America\UTM Zone 14\USA\Analyses' # desired location
filesToMove = gen_find("*dem.tif",src)
for name in filesToMove:
shutil.move(name, dst)
I think you've mixed up the way you should be using os.walk().
for dirpath, dirs, files in os.walk(src):
print dirpath
print dirs
print files
for filename in files:
if filename.endswith('dem.tif'):
shutil.move(...)
else:
...
Update: the questioner has clarified below that he / she is actually calling the move function, which was the first point in my answer.
There are a few other things to consider:
You've got the order of elements returned in each tuple from os.walk wrong, I'm afraid - check the documentation for that function.
Assuming you've fixed that, also bear in mind that you need to iterate over files, and you need to os.join each of those to root, rather than src
The above would be obvious, hopefully, if you print out the values returned by os.walk and comment out the rest of the code in that loop.
With code that does potentially destructive operations like moving files, I would always first try some code that just prints out the parameters to shutil.move until you're sure that it's right.
Any particular reason you need to do it in Python? Would a simple shell command not be simpler? If you're on a Unix-like system, or have access to Cygwin on Windows:
find src_dir -name "*dem.tif" -exec mv {} dst_dir

Categories

Resources