I am new to python. I have the following piece of code which works well by retrieving selected directories into a list for me. But because there are quite a lot of sub-directories and files, the code is rather slow, compared to the Perl code which I have upgraded it from.
using re
using os
foundarr = []
allpaths = ["X:\\Storage", "Y:\\Storage"]
for path in allpaths:
for root, dirs, files in os.walk(path):
for dir in dirs:
if re.match("[DILMPY]\d{8}", dir):
foundarr.append(os.path.join(root, dir))
break
My question: Is there a way to recurse through ONLY a selected level of directories using os.walk ? Or somehow prune the ones I do not want to recurse through? I have added the break in the for loop assuming it will break after it finds my selected dir and moves on, but I dont think this helps as it still has to go through thousands of sub-directories and files.
In the Perl code a simple $File::Find::prune = 1 if /[DILMPY]\d{8}$/; prevents the compiler from recursing through the rest of the sub-directories and files.
If the depth is fixed using glob is a good idea. As per this SO post you can set the depth of traversal using glob.
import glob
import os.path
depth2 = glob.glob('*/*')
depth2 = filter(lambda f: os.path.isdir(f), depth2)
This will list all subdirectories with a depth of 2.
Related
I'm new to python and get stuck by a problem I encountered while studying loops and folder navigation.
The task is simple: loop through a folder and count all '.txt' files.
I believe there may be some modules to tackle this task easily and I would appreciate it if you can share them. But since this is just a random question I encountered while learning python, it would be nice if this can be solved using the tools I just acquired, like for/while loops.
I used for and while clauses to loop through a folder. However, I'm unable to loop through a folder entirely.
Here is the code I used:
import os
count=0 # set count default
path = 'E:\\' # set path
while os.path.isdir(path):
for file in os.listdir(path): # loop through the folder
print(file) # print text to keep track the process
if file.endswith('.txt'):
count+=1
print('+1') #
elif os.path.isdir(os.path.join(path,file)): #if it is a subfolder
print(os.path.join(path,file))
path=os.path.join(path,file)
print('is dir')
break
else:
path=os.path.join(path,file)
Since the number of files and subfolders in a folder is unknown, I think a while loop is appropriate here. However, my code has many errors or pitfalls I don't know how to fix. for example, if multiple subfolders exist, this code will only loop the first subfolder and ignore the rest.
Your problem is that you quickly end up trying to look at non-existent files. Imagine a directory structure where a non-directory named A (E:\A) is seen first, then a file b (E:\b).
On your first loop, you get A, detect it does not end in .txt, and that it is a directory, so you change path to E:\A.
On your second iteration, you get b (meaning E:\b), but all your tests (aside from the .txt extension test) and operations concatenate it with the new path, so you test relative to E:\A\b, not E:\b.
Similarly, if E:\A is a directory, you break the inner loop immediately, so even if E:\c.txt exists, if it occurs after A in the iteration order, you never even see it.
Directory tree traversal code must involve a stack of some sort, either explicitly (by appending and poping from a list of directories for eventual processing), or implicitly (via recursion, which uses the call stack to achieve the same purpose).
In any event, your specific case should really just be handled with os.walk:
for root, dirs, files in os.walk(path):
print(root) # print text to keep track the process
count += sum(1 for f in files if f.endswith('txt'))
# This second line matches your existing behavior, but might not be intended
# Remove it if directories ending in .txt should not be included in the count
count += sum(1 for d in files if d.endswith('txt'))
Just for illustration, the explicit stack approach to your code would be something like:
import os
count = 0 # set count default
paths = ['E:\\'] # Make stack of paths to process
while paths:
# paths.pop() gets top of directory stack to process
# os.scandir is easier and more efficient than os.listdir,
# though it must be closed (but with statement does this for us)
with os.scandir(paths.pop()) as entries:
for entry in entries: # loop through the folder
print(entry.name) # print text to keep track the process
if entry.name.endswith('.txt'):
count += 1
print('+1')
elif entry.is_dir(): #if it is a subfolder
print(entry.path, 'is dir')
# Add to paths stack to get to it eventually
paths.append(entry.path)
You probably want to apply recursion to this problem. In short, you will need a function to handle directories that will call itself when it encounters a sub-directory.
This might be more than you need, but it will allow you to list all the files within the directory that are .txt files but you can also add criteria to the search within the files as well. Here is the function:
def file_search(root,extension,search,search_type):
import pandas as pd
import os
col1 = []
col2 = []
rootdir = root
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if "." + extension in file.lower():
try:
with open(os.path.join(subdir, file)) as f:
contents = f.read()
if search_type == 'any':
if any(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
elif search_type == 'all':
if all(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
except:
pass
df = pd.DataFrame({'Folder':col1,
'File':col2})[['Folder','File']]
return df
Here is an example of how to use the function:
search_df = file_search(root = r'E:\\',
search=['foo','bar'], #words to search for
extension = 'txt', #could change this to 'csv' or 'sql' etc.
search_type = 'all') #use any or all
search_df
The analysis of your code has already been addressed by #ShadowRanger's answer quite well.
I will try to address this part of your question:
there may be some modules to tackle this task easily
For these kind of tasks, there actually exists the glob module, which implements Unix style pathname pattern expansion.
To count the number of .txt files in a directory and all its subdirectories, one may simply use the following:
import os
from glob import iglob, glob
dirpath = '.' # for example
# getting all matching elements in a list a computing its length
len(glob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
# or iterating through all matching elements and summing 1 each time a new item is found
# (this approach is more memory-efficient)
sum(1 for _ in iglob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
Basically glob.iglob() is the iterator version of glob.glob().
for nested Directories it's easier to use functions like os.walk
take this for example
subfiles = []
for dirpath, subdirs, files in os.walk(path):
for x in files:
if x.endswith(".txt"):
subfiles.append(os.path.join(dirpath, x))`
and it'ill return a list of all txt files
else ull need to use Recursion for task like this
I'm trying to do the following, in this order:
Use os.walk() to go down each directory.
Each directory has subfolders, but I'm only interested in the first subfolder. So the directory looks like:
/home/RawData/SubFolder1/SubFolder2
For example. I want, in RawData2, to have folders that stop at the SubFolder1 level.
The thing is, it seems like os.walk() goes down through ALL of the RawData folder, and I'm not certain how to make it stop.
The below is what I have so far - I've tried a number of other combinations of substituting variable dirs for root, or files, but that doesn't seem to get me what I want.
import os
for root, dirs, files in os.walk("/home/RawData"):
os.chdir("/home/RawData2/")
make_path("/home/RawData2/"+str(dirs))
I suggest you use glob instead.
As the help on glob describes:
glob(pathname)
Return a list of paths matching a pathname pattern.
The pattern may contain simple shell-style wildcards a la
fnmatch. However, unlike fnmatch, filenames starting with a
dot are special cases that are not matched by '*' and '?'
patterns.
So, your pattern is every first level directory, which I think would be something like this:
/root_path/*/sub_folder1/sub_folder2
So, you start at your root, get everything in that first level, and then look for sub_folder1/sub_folder2. I think that works.
To put it all together:
from glob import glob
dirs = glob('/root_path/*/sub_folder1/sub_folder2')
# Then iterate for each path
for i in dirs:
print(i)
Beware: Documentation for os.walk says:
don’t change the current working directory between resumptions of walk(). walk() never changes the current directory, and assumes that its caller doesn’t either
so you should avoid os.chdir("/home/RawData2/") in the walk loop.
You can easily ask walk not to recurse by using topdown=True and clearing dirs:
for root, dirs, files in os.walk("/home/RawData", True):
for rep in dirs:
make_path(os.join("/home/RawData2/", rep )
# add processing here
del dirs[] # tell walk not to recurse in any sub directory
I'm looking for effective way to go to every folder (including subfolder) in my directory list. I then need to run some processes on that folder (like size, number of folders and files etc.).
I know that I have 2 options for that:
- Recurrence (my current implementation, code below)
- At start of program generating list of all folders and invoking my function in look
I know that my current implementation is not perfect can somebody take a look on it and possibly advise any updates. In addition can somebody help me howto (I'm assuming using os.path library) generate list of all folder including subfolders ?
My current code that analyse folder (using recurrence):
def analyse_folder(path, resultlist=[]):
# This is trick to check are we in last directory
subfolders = fsprocess.get_subdirs(path)
for subfolder in subfolders:
analyse_folder(subfolder, resultlist)
files, dirs = fsprocess.get_numbers(subfolder)
size = fsprocess.get_folder_size(subfolder)
resultlist = add_result([subfolder, size, files, dirs], resultlist)
return resultlist
This is the code that getting list of subfolders inside folder:
def get_subdirs(rootpath, ignorelist=[]):
# We are starting with empty list
subdirs = []
# Generate main list
for path in os.listdir(rootpath):
# We are only interested in dirs and thins not from ignore list
if not os.path.isfile(os.path.join(rootpath, path)) and path not in ignorelist:
subdirs.append(os.path.join(rootpath, path))
# We are giving back list of subdirectories
return subdirs
And this is simple function to add it to resullist:
def add_result(result, main_list):
main_list.append(result)
return main_list
So if anyone can:
1) Tell me is my attitude is good
2) Provide me code to generate list of all of directories in given folder (for example everything under C:\users)
Thank you
Try os.walk:
import os
for (root, dirs, files) in os.walk(somefolder):
# root is the place you're listing
# dirs is a list of directories directly under root
# files is a list of files directly under root
I need to os.walk from my parent path (tutu), by all subfolders. For each one, each of the deepest subfolders have the files that i need to process with my code. For all the deepest folders that have files, the file 'layout' is the same: one file *.adf.txt, one file *.idf.txt, one file *.sdrf.txt and one or more files *.dat., as pictures shown.
My problem is that i don't know how to use the os module to iterate, from my parent folder, to all subfolders sequentially. I need a function that, for the current subfolder in os.walk, if that subfolder is empty, continue to the sub-subfolder inside that subfolder, if it exists. If exists, then verify if that file layout is present (this is no problem...), and if it is, then apply the code (no problem too). If not, and if that folder don't have more sub-folders, return to the parent folder and os.walk to the next subfolder, and this for all subfolders into my parent folder (tutu). To resume, i need some function like that below (written in python/imaginary code hybrid):
for all folders in tutu:
if os.havefiles in os.walk(current_path):#the 'havefiles' don´t exist, i think...
for filename in os.walk(current_path):
if 'adf' in filename:
etc...
#my code
elif:
while true:
go deep
else:
os.chdir(parent_folder)
Do you think that is best a definition to call in my code to do the job?
this is the code that i've tried to use, without sucess, of course:
import csv
import os
import fnmatch
abs_path=os.path.abspath('.')
for dirname, subdirs, filenames in os.walk('.'):
# print path to all subdirectories first.
for subdirname in subdirs:
print os.path.join(dirname, subdirname), 'os.path.join(dirname, subdirname)'
current_path= os.path.join(dirname, subdirname)
os.chdir(current_path)
for filename in os.walk(current_path):
print filename, 'f in os.walk'
if os.path.isdir(filename)==True:
break
elif os.path.isfile(filename)==True:
print filename, 'file'
#code here
Thanks in advance...
I need a function that, for the current subfolder in os.walk, if that subfolder is empty, continue to the sub-subfolder inside that subfolder, if it exists.
This doesn't make any sense. If a folder is empty, it doesn't have any subfolders.
Maybe you mean that if it has no regular files, then recurse into its subfolders, but if it has any, don't recurse, and instead check the layout?
To do that, all you need is something like this:
for dirname, subdirs, filenames in os.walk('.'):
if filenames:
# can't use os.path.splitext, because that will give us .txt instead of .adf.txt
extensions = collections.Counter(filename.partition('.')[-1]
for filename in filenames)
if (extensions['.adf.txt'] == 1 and extensions['.idf.txt'] == 1 and
extensions['.sdrf.txt'] == 1 and extensions['.dat'] >= 1 and
len(extensions) == 4):
# got a match, do what you want
# Whether this is a match or not, prune the walk.
del subdirs[:]
I'm assuming here that you only want to find directories that have exactly the specified files, and no others. To remove that last restriction, just remove the len(extensions) == 4 part.
There's no need to explicitly iterate over subdirs or anything, or recursively call os.walk from inside os.walk. The whole point of walk is that it's already recursively visiting every subdirectory it finds, except when you explicitly tell it not to (by pruning the list it gives you).
os.walk will automatically "dig down" recursively, so you don't need to recurse the tree yourself.
I think this should be the basic form of your code:
import csv
import os
import fnmatch
directoriesToMatch = [list here...]
filenamesToMatch = [list here...]
abs_path=os.path.abspath('.')
for dirname, subdirs, filenames in os.walk('.'):
if len(set(directoriesToMatch).difference(subdirs))==0: # all dirs are there
if len(set(filenamesToMatch).difference(filenames))==0: # all files are there
if <any other filename/directory checking code>:
# processing code here ...
And according to the python documentation, if you for whatever reason don't want to continue recursing, just delete entries from subdirs:
http://docs.python.org/2/library/os.html
If you instead want to check that there are NO sub-directories where you find your files to process, you could also change the dirs check to:
if len(subdirs)==0: # check that this is an empty directory
I'm not sure I quite understand the question, so I hope this helps!
Edit:
Ok, so if you need to check there are no files instead, just use:
if len(filenames)==0:
But as I stated above, it would probably be better to just look FOR specific files instead of checking for empty directories.
I have a large structure of thousands of folders only, however I am only interested in keeping the folders in the top three levels, and deleting the rest. I am looking for a recursive python script to do this. Any help is much appreciated.
Untested, but it will probably look something like this with os.walk():
import os
import shutil
BASE = '.'
for root, dirs, files in os.walk(BASE):
n = 0
head = root
while head and head != BASE:
head, _ = os.path.split(head)
n += 1
if n == 3:
for dir in dirs:
shutil.rmtree(os.path.join(root, dir))
del dirs[:] # clear dirs so os.walk() doesn't look for subdirectories
The right way to do this is with os.walk, but here's a cheap answer:
>>> import os
>>> os.system('rm -rf */*/*/*/*')
>>> os.system('rmdir */*/*/*')
This will remove all files at least four levels in, and then try to remove all directories rooted at least three levels in. Since the previous command will have removed their contents, the rmdir will succeed (and complain about all non-directory leaves it finds).