Making os.walk work in a non-standard way - python

I'm trying to do the following, in this order:
Use os.walk() to go down each directory.
Each directory has subfolders, but I'm only interested in the first subfolder. So the directory looks like:
/home/RawData/SubFolder1/SubFolder2
For example. I want, in RawData2, to have folders that stop at the SubFolder1 level.
The thing is, it seems like os.walk() goes down through ALL of the RawData folder, and I'm not certain how to make it stop.
The below is what I have so far - I've tried a number of other combinations of substituting variable dirs for root, or files, but that doesn't seem to get me what I want.
import os
for root, dirs, files in os.walk("/home/RawData"):
os.chdir("/home/RawData2/")
make_path("/home/RawData2/"+str(dirs))

I suggest you use glob instead.
As the help on glob describes:
glob(pathname)
Return a list of paths matching a pathname pattern.
The pattern may contain simple shell-style wildcards a la
fnmatch. However, unlike fnmatch, filenames starting with a
dot are special cases that are not matched by '*' and '?'
patterns.
So, your pattern is every first level directory, which I think would be something like this:
/root_path/*/sub_folder1/sub_folder2
So, you start at your root, get everything in that first level, and then look for sub_folder1/sub_folder2. I think that works.
To put it all together:
from glob import glob
dirs = glob('/root_path/*/sub_folder1/sub_folder2')
# Then iterate for each path
for i in dirs:
print(i)

Beware: Documentation for os.walk says:
don’t change the current working directory between resumptions of walk(). walk() never changes the current directory, and assumes that its caller doesn’t either
so you should avoid os.chdir("/home/RawData2/") in the walk loop.
You can easily ask walk not to recurse by using topdown=True and clearing dirs:
for root, dirs, files in os.walk("/home/RawData", True):
for rep in dirs:
make_path(os.join("/home/RawData2/", rep )
# add processing here
del dirs[] # tell walk not to recurse in any sub directory

Related

Getting the absolute paths of all files in a folder, without traversing the subfolders

Let
my_dir = "/raid/user/my_dir"
be a folder on my filesystem, which is not the current folder (i.e., it's not the result of os.getcwd()). I want to retrieve the absolute paths of all files at the first level of hierarchy in my_dir (i.e., the absolute paths of all files which are in my_dir, but not in a subfolder of my_dir) as a list of strings absolute_paths. I need it, in order to later delete those files with os.remove().
This is nearly the same use case as
Get absolute paths of all files in a directory
but the difference is that I don't want to traverse the folder hierarchy: I only need the files at the first level of hierarchy (at depth 0? not sure about terminology here).
It's easy to adapt that solution: Call os.walk() just once, and don't let it continue:
root, dirs, files = next(os.walk(my_dir, topdown=True))
files = [ os.path.join(root, f) for f in files ]
print(files)
You can use the os.path module and a list comprehension.
import os
absolute_paths= [os.path.abspath(f) for f in os.listdir(my_dir) if os.path.isfile(f)]
You can use os.scandir which returns an os.DirEntry object that has a variety of options including the ability to distinguish files from directories.
with os.scandir(somePath) as it:
paths = [entry.path for entry in it if entry.is_file()]
print(paths)
If you want to list directories as well, you can, of course, remove the condition from the list comprehension if you want to see them in the list.
The documentation also has this note under listDir:
See also The scandir() function returns directory entries along with file attribute information, giving better performance for many common use cases.

Recurse through selected level of subdirectories

I am new to python. I have the following piece of code which works well by retrieving selected directories into a list for me. But because there are quite a lot of sub-directories and files, the code is rather slow, compared to the Perl code which I have upgraded it from.
using re
using os
foundarr = []
allpaths = ["X:\\Storage", "Y:\\Storage"]
for path in allpaths:
for root, dirs, files in os.walk(path):
for dir in dirs:
if re.match("[DILMPY]\d{8}", dir):
foundarr.append(os.path.join(root, dir))
break
My question: Is there a way to recurse through ONLY a selected level of directories using os.walk ? Or somehow prune the ones I do not want to recurse through? I have added the break in the for loop assuming it will break after it finds my selected dir and moves on, but I dont think this helps as it still has to go through thousands of sub-directories and files.
In the Perl code a simple $File::Find::prune = 1 if /[DILMPY]\d{8}$/; prevents the compiler from recursing through the rest of the sub-directories and files.
If the depth is fixed using glob is a good idea. As per this SO post you can set the depth of traversal using glob.
import glob
import os.path
depth2 = glob.glob('*/*')
depth2 = filter(lambda f: os.path.isdir(f), depth2)
This will list all subdirectories with a depth of 2.

Need 'if os.havefiles' like function for subfolder search in python

I need to os.walk from my parent path (tutu), by all subfolders. For each one, each of the deepest subfolders have the files that i need to process with my code. For all the deepest folders that have files, the file 'layout' is the same: one file *.adf.txt, one file *.idf.txt, one file *.sdrf.txt and one or more files *.dat., as pictures shown.
My problem is that i don't know how to use the os module to iterate, from my parent folder, to all subfolders sequentially. I need a function that, for the current subfolder in os.walk, if that subfolder is empty, continue to the sub-subfolder inside that subfolder, if it exists. If exists, then verify if that file layout is present (this is no problem...), and if it is, then apply the code (no problem too). If not, and if that folder don't have more sub-folders, return to the parent folder and os.walk to the next subfolder, and this for all subfolders into my parent folder (tutu). To resume, i need some function like that below (written in python/imaginary code hybrid):
for all folders in tutu:
if os.havefiles in os.walk(current_path):#the 'havefiles' don´t exist, i think...
for filename in os.walk(current_path):
if 'adf' in filename:
etc...
#my code
elif:
while true:
go deep
else:
os.chdir(parent_folder)
Do you think that is best a definition to call in my code to do the job?
this is the code that i've tried to use, without sucess, of course:
import csv
import os
import fnmatch
abs_path=os.path.abspath('.')
for dirname, subdirs, filenames in os.walk('.'):
# print path to all subdirectories first.
for subdirname in subdirs:
print os.path.join(dirname, subdirname), 'os.path.join(dirname, subdirname)'
current_path= os.path.join(dirname, subdirname)
os.chdir(current_path)
for filename in os.walk(current_path):
print filename, 'f in os.walk'
if os.path.isdir(filename)==True:
break
elif os.path.isfile(filename)==True:
print filename, 'file'
#code here
Thanks in advance...
I need a function that, for the current subfolder in os.walk, if that subfolder is empty, continue to the sub-subfolder inside that subfolder, if it exists.
This doesn't make any sense. If a folder is empty, it doesn't have any subfolders.
Maybe you mean that if it has no regular files, then recurse into its subfolders, but if it has any, don't recurse, and instead check the layout?
To do that, all you need is something like this:
for dirname, subdirs, filenames in os.walk('.'):
if filenames:
# can't use os.path.splitext, because that will give us .txt instead of .adf.txt
extensions = collections.Counter(filename.partition('.')[-1]
for filename in filenames)
if (extensions['.adf.txt'] == 1 and extensions['.idf.txt'] == 1 and
extensions['.sdrf.txt'] == 1 and extensions['.dat'] >= 1 and
len(extensions) == 4):
# got a match, do what you want
# Whether this is a match or not, prune the walk.
del subdirs[:]
I'm assuming here that you only want to find directories that have exactly the specified files, and no others. To remove that last restriction, just remove the len(extensions) == 4 part.
There's no need to explicitly iterate over subdirs or anything, or recursively call os.walk from inside os.walk. The whole point of walk is that it's already recursively visiting every subdirectory it finds, except when you explicitly tell it not to (by pruning the list it gives you).
os.walk will automatically "dig down" recursively, so you don't need to recurse the tree yourself.
I think this should be the basic form of your code:
import csv
import os
import fnmatch
directoriesToMatch = [list here...]
filenamesToMatch = [list here...]
abs_path=os.path.abspath('.')
for dirname, subdirs, filenames in os.walk('.'):
if len(set(directoriesToMatch).difference(subdirs))==0: # all dirs are there
if len(set(filenamesToMatch).difference(filenames))==0: # all files are there
if <any other filename/directory checking code>:
# processing code here ...
And according to the python documentation, if you for whatever reason don't want to continue recursing, just delete entries from subdirs:
http://docs.python.org/2/library/os.html
If you instead want to check that there are NO sub-directories where you find your files to process, you could also change the dirs check to:
if len(subdirs)==0: # check that this is an empty directory
I'm not sure I quite understand the question, so I hope this helps!
Edit:
Ok, so if you need to check there are no files instead, just use:
if len(filenames)==0:
But as I stated above, it would probably be better to just look FOR specific files instead of checking for empty directories.

os.walk iteration not walking in Python

I'm using os.walk() to check a directory for redundant files and list them out. The pseudo-code looks something like this:
def checkPath(path):
do the for dirname, dirnames, filenames in os.walk(path) thing here...
pathList = ["path1", "path2"]
for each in pathList:
checkPath(each)
So this works fine the first run through, I get everything as expected, but on the next os.walk on the second path it just skips right on through...there's nothing in dirname, dirnames, filenames. I did some print statements to check things out, and it's entering the function, but not doing anything for the os.walk() part.
before making the os.walk() part a function to see if it would fix the problem, it was in a for loop inline with the main body. When I tried (just for fun) cleaning up the dirname, dirnames, filenames variables with del, on the second path when the cleanup came it said that the variable dirname did not exist...
So it looks like, whether within a function or not, the successive iterations of os.walk() arent populating...
ideas?
Thanks!
To add some working code as an example, something like this. It doesn't really matter what it's doing, just trying to get the os.walk to walk mult paths:
import os
def checkPath(path):
for dirname, dirnames, filenames in os.walk(path):
for filename in filenames:
print filename
pathList = ["c:\temp\folder1", "c:\temp\folder2"]
for path in pathList:
checkPath(path)
print "done"
It could be done this way (was trying to see if calling os.walk in a different way, like one of the other commenters suggested, might help), or it can be done inline, whatever works obviously...
thanks again all,
Your code works for me, if I use actual paths on my system which refer to non-empty directories.
I suspect you might have an issue with the line...
pathList = ["c:\temp\folder1", "c:\temp\folder2"]
...since both \t and \f are valid escape sequences.
Try...
pathList = ["c:\\temp\\folder1", "c:\\temp\\folder2"]
...and if that's not the problem, then it would help to cite the actual code you're using.
os.walk returns a generator :-) http://wiki.python.org/moin/Generators
There are a few workarounds:
use a list
ll = list(os.walk())
call os.walk() each time
use itertools.chain
The code you posted should not have this problem (you call os.walk each time), but it makes me really think about generator exhaustion. So post your code as you wrote it [0]
[0] for example, do you have some sort of predefined argument in your function?
Here is a working example
import os
def checkPath(list_path):
for path in list_path:
for (path, dirs, files) in os.walk(path):
print len(files)
checkPath(["F:/","F:/"])
See doc:
Generate the file names in a directory tree by walking the tree either
top-down or bottom-up. For each directory in the tree rooted at
directory top (including top itself), it yields a 3-tuple (dirpath,
dirnames, filenames).
EDIT:
as mentioned in your answers, os.walk() returns a generator. A generator can be iterated through once only. It is not a structure storing values, but generates the values on the fly, as it is called. That's why your second loop on os.walk(), you have no more results. You may ask os.walk() each time you need it, or store os.walk() to an iterable.

How do I prevent Python's os.walk from walking across mount points?

In Unix all disks are exposed as paths in the main filesystem, so os.walk('/') would traverse, for example, /media/cdrom as well as the primary hard disk, and that is undesirable for some applications.
How do I get an os.walk that stays on a single device?
Related:
Is there a way to determine if a subdirectory is in the same filesystem from python when using os.walk?
From os.walk docs:
When topdown is true, the caller can
modify the dirnames list in-place
(perhaps using del or slice
assignment), and walk() will only
recurse into the subdirectories whose
names remain in dirnames; this can be
used to prune the search
So something like this should work:
for root, dirnames, filenames in os.walk(...):
dirnames[:] = [
dir for dir in dirnames
if not os.path.ismount(os.path.join(root, dir))]
...
I think os.path.ismount might work for you. You code might look something like this:
import os
import os.path
for root, dirs, files in os.walk('/'):
# Handle files.
dirs[:] = filter(lambda dir: not os.path.ismount(os.path.join(root, dir)),
dirs)
You may also find this answer helpful in building your solution.
*Thanks for the comments on filtering dirs correctly.
os.walk() can't tell (as far as I know) that it is browsing a different drive. You will need to check that yourself.
Try using os.stat(), or checking that the root variable from os.walk() is not /media

Categories

Resources