I'm using os.walk() to check a directory for redundant files and list them out. The pseudo-code looks something like this:
def checkPath(path):
do the for dirname, dirnames, filenames in os.walk(path) thing here...
pathList = ["path1", "path2"]
for each in pathList:
checkPath(each)
So this works fine the first run through, I get everything as expected, but on the next os.walk on the second path it just skips right on through...there's nothing in dirname, dirnames, filenames. I did some print statements to check things out, and it's entering the function, but not doing anything for the os.walk() part.
before making the os.walk() part a function to see if it would fix the problem, it was in a for loop inline with the main body. When I tried (just for fun) cleaning up the dirname, dirnames, filenames variables with del, on the second path when the cleanup came it said that the variable dirname did not exist...
So it looks like, whether within a function or not, the successive iterations of os.walk() arent populating...
ideas?
Thanks!
To add some working code as an example, something like this. It doesn't really matter what it's doing, just trying to get the os.walk to walk mult paths:
import os
def checkPath(path):
for dirname, dirnames, filenames in os.walk(path):
for filename in filenames:
print filename
pathList = ["c:\temp\folder1", "c:\temp\folder2"]
for path in pathList:
checkPath(path)
print "done"
It could be done this way (was trying to see if calling os.walk in a different way, like one of the other commenters suggested, might help), or it can be done inline, whatever works obviously...
thanks again all,
Your code works for me, if I use actual paths on my system which refer to non-empty directories.
I suspect you might have an issue with the line...
pathList = ["c:\temp\folder1", "c:\temp\folder2"]
...since both \t and \f are valid escape sequences.
Try...
pathList = ["c:\\temp\\folder1", "c:\\temp\\folder2"]
...and if that's not the problem, then it would help to cite the actual code you're using.
os.walk returns a generator :-) http://wiki.python.org/moin/Generators
There are a few workarounds:
use a list
ll = list(os.walk())
call os.walk() each time
use itertools.chain
The code you posted should not have this problem (you call os.walk each time), but it makes me really think about generator exhaustion. So post your code as you wrote it [0]
[0] for example, do you have some sort of predefined argument in your function?
Here is a working example
import os
def checkPath(list_path):
for path in list_path:
for (path, dirs, files) in os.walk(path):
print len(files)
checkPath(["F:/","F:/"])
See doc:
Generate the file names in a directory tree by walking the tree either
top-down or bottom-up. For each directory in the tree rooted at
directory top (including top itself), it yields a 3-tuple (dirpath,
dirnames, filenames).
EDIT:
as mentioned in your answers, os.walk() returns a generator. A generator can be iterated through once only. It is not a structure storing values, but generates the values on the fly, as it is called. That's why your second loop on os.walk(), you have no more results. You may ask os.walk() each time you need it, or store os.walk() to an iterable.
Related
os.walk() yields a 3-tuple (dirpath, dirnames, filenames) (see documentation).
If I want to use certain parts of dirpath, is there a more robust way than splitting on \?
In a path like "C:\top_folder\project\somethingelse" I currently do dirpath.split('\\')[2] # project but if the folder structure changes, I'd need to change that line and I wouldn't always know if the folder structure changes until my scripts break.
Is there a way to dissect dirpath to be more specific, similar to os.path.basename, just more generically?
Pathlib does this in a very nice way;
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.parts
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.parent
It's considered the modern way to work with Python and paths.
If i got your question right, you want to get the parent "dirpath" which contains "somethingelse". You could simply do
dirpath.split('\\')[-2]
It will not be affected even if the directory structure changes before dirpath
I'm trying to do the following, in this order:
Use os.walk() to go down each directory.
Each directory has subfolders, but I'm only interested in the first subfolder. So the directory looks like:
/home/RawData/SubFolder1/SubFolder2
For example. I want, in RawData2, to have folders that stop at the SubFolder1 level.
The thing is, it seems like os.walk() goes down through ALL of the RawData folder, and I'm not certain how to make it stop.
The below is what I have so far - I've tried a number of other combinations of substituting variable dirs for root, or files, but that doesn't seem to get me what I want.
import os
for root, dirs, files in os.walk("/home/RawData"):
os.chdir("/home/RawData2/")
make_path("/home/RawData2/"+str(dirs))
I suggest you use glob instead.
As the help on glob describes:
glob(pathname)
Return a list of paths matching a pathname pattern.
The pattern may contain simple shell-style wildcards a la
fnmatch. However, unlike fnmatch, filenames starting with a
dot are special cases that are not matched by '*' and '?'
patterns.
So, your pattern is every first level directory, which I think would be something like this:
/root_path/*/sub_folder1/sub_folder2
So, you start at your root, get everything in that first level, and then look for sub_folder1/sub_folder2. I think that works.
To put it all together:
from glob import glob
dirs = glob('/root_path/*/sub_folder1/sub_folder2')
# Then iterate for each path
for i in dirs:
print(i)
Beware: Documentation for os.walk says:
don’t change the current working directory between resumptions of walk(). walk() never changes the current directory, and assumes that its caller doesn’t either
so you should avoid os.chdir("/home/RawData2/") in the walk loop.
You can easily ask walk not to recurse by using topdown=True and clearing dirs:
for root, dirs, files in os.walk("/home/RawData", True):
for rep in dirs:
make_path(os.join("/home/RawData2/", rep )
# add processing here
del dirs[] # tell walk not to recurse in any sub directory
I need to os.walk from my parent path (tutu), by all subfolders. For each one, each of the deepest subfolders have the files that i need to process with my code. For all the deepest folders that have files, the file 'layout' is the same: one file *.adf.txt, one file *.idf.txt, one file *.sdrf.txt and one or more files *.dat., as pictures shown.
My problem is that i don't know how to use the os module to iterate, from my parent folder, to all subfolders sequentially. I need a function that, for the current subfolder in os.walk, if that subfolder is empty, continue to the sub-subfolder inside that subfolder, if it exists. If exists, then verify if that file layout is present (this is no problem...), and if it is, then apply the code (no problem too). If not, and if that folder don't have more sub-folders, return to the parent folder and os.walk to the next subfolder, and this for all subfolders into my parent folder (tutu). To resume, i need some function like that below (written in python/imaginary code hybrid):
for all folders in tutu:
if os.havefiles in os.walk(current_path):#the 'havefiles' don´t exist, i think...
for filename in os.walk(current_path):
if 'adf' in filename:
etc...
#my code
elif:
while true:
go deep
else:
os.chdir(parent_folder)
Do you think that is best a definition to call in my code to do the job?
this is the code that i've tried to use, without sucess, of course:
import csv
import os
import fnmatch
abs_path=os.path.abspath('.')
for dirname, subdirs, filenames in os.walk('.'):
# print path to all subdirectories first.
for subdirname in subdirs:
print os.path.join(dirname, subdirname), 'os.path.join(dirname, subdirname)'
current_path= os.path.join(dirname, subdirname)
os.chdir(current_path)
for filename in os.walk(current_path):
print filename, 'f in os.walk'
if os.path.isdir(filename)==True:
break
elif os.path.isfile(filename)==True:
print filename, 'file'
#code here
Thanks in advance...
I need a function that, for the current subfolder in os.walk, if that subfolder is empty, continue to the sub-subfolder inside that subfolder, if it exists.
This doesn't make any sense. If a folder is empty, it doesn't have any subfolders.
Maybe you mean that if it has no regular files, then recurse into its subfolders, but if it has any, don't recurse, and instead check the layout?
To do that, all you need is something like this:
for dirname, subdirs, filenames in os.walk('.'):
if filenames:
# can't use os.path.splitext, because that will give us .txt instead of .adf.txt
extensions = collections.Counter(filename.partition('.')[-1]
for filename in filenames)
if (extensions['.adf.txt'] == 1 and extensions['.idf.txt'] == 1 and
extensions['.sdrf.txt'] == 1 and extensions['.dat'] >= 1 and
len(extensions) == 4):
# got a match, do what you want
# Whether this is a match or not, prune the walk.
del subdirs[:]
I'm assuming here that you only want to find directories that have exactly the specified files, and no others. To remove that last restriction, just remove the len(extensions) == 4 part.
There's no need to explicitly iterate over subdirs or anything, or recursively call os.walk from inside os.walk. The whole point of walk is that it's already recursively visiting every subdirectory it finds, except when you explicitly tell it not to (by pruning the list it gives you).
os.walk will automatically "dig down" recursively, so you don't need to recurse the tree yourself.
I think this should be the basic form of your code:
import csv
import os
import fnmatch
directoriesToMatch = [list here...]
filenamesToMatch = [list here...]
abs_path=os.path.abspath('.')
for dirname, subdirs, filenames in os.walk('.'):
if len(set(directoriesToMatch).difference(subdirs))==0: # all dirs are there
if len(set(filenamesToMatch).difference(filenames))==0: # all files are there
if <any other filename/directory checking code>:
# processing code here ...
And according to the python documentation, if you for whatever reason don't want to continue recursing, just delete entries from subdirs:
http://docs.python.org/2/library/os.html
If you instead want to check that there are NO sub-directories where you find your files to process, you could also change the dirs check to:
if len(subdirs)==0: # check that this is an empty directory
I'm not sure I quite understand the question, so I hope this helps!
Edit:
Ok, so if you need to check there are no files instead, just use:
if len(filenames)==0:
But as I stated above, it would probably be better to just look FOR specific files instead of checking for empty directories.
What is the simplest way to get the full recursive list of files inside a folder with python? I know about os.walk(), but it seems overkill for just getting the unfiltered list of all files. Is it really the only option?
There's nothing preventing you from creating your own function:
import os
def listfiles(folder):
for root, folders, files in os.walk(folder):
for filename in folders + files:
yield os.path.join(root, filename)
You can use it like so:
for filename in listfiles('/etc/'):
print filename
os.walk() is not overkill by any means. It can generate your list of files and directories in a jiffy:
files = [os.path.join(dirpath, filename)
for (dirpath, dirs, files) in os.walk('.')
for filename in (dirs + files)]
You can turn this into a generator, to only process one path at a time and safe on memory.
You could also use the find program itself from Python by using sh
import sh
text_files = sh.find(".", "-iname", "*.txt")
Either that or manually recursing with isdir() / isfile() and listdir() or you could use subprocess.check_output() and call find .. Bascially os.walk() is highest level, slightly lower level is semi-manual solution based on listdir() and if you want the same output find . would give you for some reason you can make a system call with subprocess.
pathlib.Path.rglob is pretty simple. It lists the entire directory tree
(The argument is a filepath search pattern. "*" means list everything)
import pathlib
for path in pathlib.Path("directory_to_list/").rglob("*"):
print(path)
os.walk() is hard to use, just kick it and use pathlib instead.
Here is a python function mimicking a similar function of list.files in R language.
def list_files(path,pattern,full_names=False,recursive=True):
if(recursive):
files=pathlib.Path(path).rglob(pattern)
else:
files=pathlib.Path(path).glob(pattern)
if full_names:
files=[str(f) for f in files]
else:
files=[f.name for f in files]
return(files)
import os
path = "path/to/your/dir"
for (path, dirs, files) in os.walk(path):
print files
Is this overkill, or am I missing something?
In Unix all disks are exposed as paths in the main filesystem, so os.walk('/') would traverse, for example, /media/cdrom as well as the primary hard disk, and that is undesirable for some applications.
How do I get an os.walk that stays on a single device?
Related:
Is there a way to determine if a subdirectory is in the same filesystem from python when using os.walk?
From os.walk docs:
When topdown is true, the caller can
modify the dirnames list in-place
(perhaps using del or slice
assignment), and walk() will only
recurse into the subdirectories whose
names remain in dirnames; this can be
used to prune the search
So something like this should work:
for root, dirnames, filenames in os.walk(...):
dirnames[:] = [
dir for dir in dirnames
if not os.path.ismount(os.path.join(root, dir))]
...
I think os.path.ismount might work for you. You code might look something like this:
import os
import os.path
for root, dirs, files in os.walk('/'):
# Handle files.
dirs[:] = filter(lambda dir: not os.path.ismount(os.path.join(root, dir)),
dirs)
You may also find this answer helpful in building your solution.
*Thanks for the comments on filtering dirs correctly.
os.walk() can't tell (as far as I know) that it is browsing a different drive. You will need to check that yourself.
Try using os.stat(), or checking that the root variable from os.walk() is not /media