More generic os.path.basename for os.walk()? - python

os.walk() yields a 3-tuple (dirpath, dirnames, filenames) (see documentation).
If I want to use certain parts of dirpath, is there a more robust way than splitting on \?
In a path like "C:\top_folder\project\somethingelse" I currently do dirpath.split('\\')[2] # project but if the folder structure changes, I'd need to change that line and I wouldn't always know if the folder structure changes until my scripts break.
Is there a way to dissect dirpath to be more specific, similar to os.path.basename, just more generically?

Pathlib does this in a very nice way;
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.parts
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.parent
It's considered the modern way to work with Python and paths.

If i got your question right, you want to get the parent "dirpath" which contains "somethingelse". You could simply do
dirpath.split('\\')[-2]
It will not be affected even if the directory structure changes before dirpath

Related

Get absolute file path list and ignore dot directories/files python

How to get Absolute file path within a specified directory and ignore dot(.) directories and dot(.)files
I have below solution, which will provide a full path within the directory recursively,
Help me with the fastest way of list files with full path and ignore .directories/ and .files to list
(Directory may contain 100 to 500 millions files )
import os
def absoluteFilePath(directory):
for dirpath,_,filenames in os.walk(directory):
for f in filenames:
yield os.path.abspath(os.path.join(dirpath, f))
for files in absoluteFilePath("/my-huge-files"):
#use some start with dot logic ? or any better solution
Example:
/my-huge-files/project1/file{1..100} # Consider all files from file1 to 100
/my-huge-files/.project1/file{1..100} # ignore .project1 directory and its files (Do not need any files under .(dot) directories)
/my-huge-files/project1/.file1000 # ignore .file1000, it is starts with dot
os.walk by definition visits every file in a hierarchy, but you can select which ones you actually print with a simple textual filter.
for file in absoluteFilePath("/my-huge-files"):
if '/.' not in file:
print(file)
When your starting path is already absolute, calling os.path.abspath on it is redundant, but I guess in the great scheme of things, you can just leave it in.
Don't use os.walk() as it will visit every file
Instead, fall back to .scandir() or .listdir() and write your own implementation
You can use pathlib.Path(test_path).expanduser().resolve() to fully expand a path
import os
from pathlib import Path
def walk_ignore(search_root, ignore_prefixes=(".",)):
""" recursively walk directories, ignoring files with some prefix
pass search_root as an absolute directory to get absolute results
"""
for dir_entry in os.scandir(Path(search_root)):
if dir_entry.name.startswith(ignore_prefixes):
continue
if dir_entry.is_dir():
yield from walk_ignore(dir_entry, ignore_prefixes=ignore_prefixes)
else:
yield Path(dir_entry)
You may be able to save some overhead with a closure, coercing to Path once, yielding only .name, etc., but that's really up to your needs
Also not to your question, but related to it; if the files are very small, you'll likely find that packing them together (several files in one) or tuning the filesystem block size will see tremendously better performance
Finally, some filesystems come with bizarre caveats specific to them and you can likely break this with oddities like symlink loops

Getting the absolute paths of all files in a folder, without traversing the subfolders

Let
my_dir = "/raid/user/my_dir"
be a folder on my filesystem, which is not the current folder (i.e., it's not the result of os.getcwd()). I want to retrieve the absolute paths of all files at the first level of hierarchy in my_dir (i.e., the absolute paths of all files which are in my_dir, but not in a subfolder of my_dir) as a list of strings absolute_paths. I need it, in order to later delete those files with os.remove().
This is nearly the same use case as
Get absolute paths of all files in a directory
but the difference is that I don't want to traverse the folder hierarchy: I only need the files at the first level of hierarchy (at depth 0? not sure about terminology here).
It's easy to adapt that solution: Call os.walk() just once, and don't let it continue:
root, dirs, files = next(os.walk(my_dir, topdown=True))
files = [ os.path.join(root, f) for f in files ]
print(files)
You can use the os.path module and a list comprehension.
import os
absolute_paths= [os.path.abspath(f) for f in os.listdir(my_dir) if os.path.isfile(f)]
You can use os.scandir which returns an os.DirEntry object that has a variety of options including the ability to distinguish files from directories.
with os.scandir(somePath) as it:
paths = [entry.path for entry in it if entry.is_file()]
print(paths)
If you want to list directories as well, you can, of course, remove the condition from the list comprehension if you want to see them in the list.
The documentation also has this note under listDir:
See also The scandir() function returns directory entries along with file attribute information, giving better performance for many common use cases.

Making os.walk work in a non-standard way

I'm trying to do the following, in this order:
Use os.walk() to go down each directory.
Each directory has subfolders, but I'm only interested in the first subfolder. So the directory looks like:
/home/RawData/SubFolder1/SubFolder2
For example. I want, in RawData2, to have folders that stop at the SubFolder1 level.
The thing is, it seems like os.walk() goes down through ALL of the RawData folder, and I'm not certain how to make it stop.
The below is what I have so far - I've tried a number of other combinations of substituting variable dirs for root, or files, but that doesn't seem to get me what I want.
import os
for root, dirs, files in os.walk("/home/RawData"):
os.chdir("/home/RawData2/")
make_path("/home/RawData2/"+str(dirs))
I suggest you use glob instead.
As the help on glob describes:
glob(pathname)
Return a list of paths matching a pathname pattern.
The pattern may contain simple shell-style wildcards a la
fnmatch. However, unlike fnmatch, filenames starting with a
dot are special cases that are not matched by '*' and '?'
patterns.
So, your pattern is every first level directory, which I think would be something like this:
/root_path/*/sub_folder1/sub_folder2
So, you start at your root, get everything in that first level, and then look for sub_folder1/sub_folder2. I think that works.
To put it all together:
from glob import glob
dirs = glob('/root_path/*/sub_folder1/sub_folder2')
# Then iterate for each path
for i in dirs:
print(i)
Beware: Documentation for os.walk says:
don’t change the current working directory between resumptions of walk(). walk() never changes the current directory, and assumes that its caller doesn’t either
so you should avoid os.chdir("/home/RawData2/") in the walk loop.
You can easily ask walk not to recurse by using topdown=True and clearing dirs:
for root, dirs, files in os.walk("/home/RawData", True):
for rep in dirs:
make_path(os.join("/home/RawData2/", rep )
# add processing here
del dirs[] # tell walk not to recurse in any sub directory

os.walk iteration not walking in Python

I'm using os.walk() to check a directory for redundant files and list them out. The pseudo-code looks something like this:
def checkPath(path):
do the for dirname, dirnames, filenames in os.walk(path) thing here...
pathList = ["path1", "path2"]
for each in pathList:
checkPath(each)
So this works fine the first run through, I get everything as expected, but on the next os.walk on the second path it just skips right on through...there's nothing in dirname, dirnames, filenames. I did some print statements to check things out, and it's entering the function, but not doing anything for the os.walk() part.
before making the os.walk() part a function to see if it would fix the problem, it was in a for loop inline with the main body. When I tried (just for fun) cleaning up the dirname, dirnames, filenames variables with del, on the second path when the cleanup came it said that the variable dirname did not exist...
So it looks like, whether within a function or not, the successive iterations of os.walk() arent populating...
ideas?
Thanks!
To add some working code as an example, something like this. It doesn't really matter what it's doing, just trying to get the os.walk to walk mult paths:
import os
def checkPath(path):
for dirname, dirnames, filenames in os.walk(path):
for filename in filenames:
print filename
pathList = ["c:\temp\folder1", "c:\temp\folder2"]
for path in pathList:
checkPath(path)
print "done"
It could be done this way (was trying to see if calling os.walk in a different way, like one of the other commenters suggested, might help), or it can be done inline, whatever works obviously...
thanks again all,
Your code works for me, if I use actual paths on my system which refer to non-empty directories.
I suspect you might have an issue with the line...
pathList = ["c:\temp\folder1", "c:\temp\folder2"]
...since both \t and \f are valid escape sequences.
Try...
pathList = ["c:\\temp\\folder1", "c:\\temp\\folder2"]
...and if that's not the problem, then it would help to cite the actual code you're using.
os.walk returns a generator :-) http://wiki.python.org/moin/Generators
There are a few workarounds:
use a list
ll = list(os.walk())
call os.walk() each time
use itertools.chain
The code you posted should not have this problem (you call os.walk each time), but it makes me really think about generator exhaustion. So post your code as you wrote it [0]
[0] for example, do you have some sort of predefined argument in your function?
Here is a working example
import os
def checkPath(list_path):
for path in list_path:
for (path, dirs, files) in os.walk(path):
print len(files)
checkPath(["F:/","F:/"])
See doc:
Generate the file names in a directory tree by walking the tree either
top-down or bottom-up. For each directory in the tree rooted at
directory top (including top itself), it yields a 3-tuple (dirpath,
dirnames, filenames).
EDIT:
as mentioned in your answers, os.walk() returns a generator. A generator can be iterated through once only. It is not a structure storing values, but generates the values on the fly, as it is called. That's why your second loop on os.walk(), you have no more results. You may ask os.walk() each time you need it, or store os.walk() to an iterable.

How do I prevent Python's os.walk from walking across mount points?

In Unix all disks are exposed as paths in the main filesystem, so os.walk('/') would traverse, for example, /media/cdrom as well as the primary hard disk, and that is undesirable for some applications.
How do I get an os.walk that stays on a single device?
Related:
Is there a way to determine if a subdirectory is in the same filesystem from python when using os.walk?
From os.walk docs:
When topdown is true, the caller can
modify the dirnames list in-place
(perhaps using del or slice
assignment), and walk() will only
recurse into the subdirectories whose
names remain in dirnames; this can be
used to prune the search
So something like this should work:
for root, dirnames, filenames in os.walk(...):
dirnames[:] = [
dir for dir in dirnames
if not os.path.ismount(os.path.join(root, dir))]
...
I think os.path.ismount might work for you. You code might look something like this:
import os
import os.path
for root, dirs, files in os.walk('/'):
# Handle files.
dirs[:] = filter(lambda dir: not os.path.ismount(os.path.join(root, dir)),
dirs)
You may also find this answer helpful in building your solution.
*Thanks for the comments on filtering dirs correctly.
os.walk() can't tell (as far as I know) that it is browsing a different drive. You will need to check that yourself.
Try using os.stat(), or checking that the root variable from os.walk() is not /media

Categories

Resources