Avoid the base path in a recursive tree walking - python

I know how to list recursively all files/folders of d:\temp with various methods, see How to use glob() to find files recursively?.
But often I'd like to avoid to have the d:\temp\ prefix in the results, and have relative paths to this base instead.
This can be done with:
import os, glob
for f in glob.glob('d:\\temp\\**\\*', recursive=True):
print(os.path.relpath(f, 'd:\\temp'))
idem with f.lstrip('d:\\temp\\') which removes this prefix
import pathlib
root = pathlib.Path("d:\\temp")
print([p.relative_to(root) for p in root.glob("**/*")])
These 3 solutions work. But in fact if you read the source of glob.py, it does accumulate/join all the parts of the path. So the solution above is ... "removing something that was just added before"! It works, but it's not very elegant. Idem for pathlib with relative_to which removes the prefix.
Question: how to modify the next few lines to not have d:\temp in the output (without removing something that was concatenated before!)?
import os
def listpath(path):
for f in os.scandir(path):
f2 = os.path.join(path, f)
if os.path.isdir(f):
yield f2
yield from listpath(f2)
else:
yield f2
for f in listpath('d:\\temp'):
print(f)
#d:\temp\New folder
#d:\temp\New folder\New Text Document - Copy.txt
#d:\temp\New folder\New Text Document.txt
#d:\temp\New Text Document - Copy.txt
#d:\temp\New Text Document.txt

You can do something like shown in the following example. Basically, we recursively return the path parts joining them together, but we don't join the initial root.
import os
def listpath(root, parent=''):
scan = os.path.join(root, parent)
for f in os.scandir(scan):
f2 = os.path.join(parent, f.name)
yield f2
if f.is_dir():
yield from listpath(root, f2)
for f in listpath('d:\\temp'):
print(f)
In Python 3.10, which is not released yet, there will be a new root_dir option which will allow you to do this with the built-in glob with no problem:
import glob
glob.glob('**/*', root_dir='d:\\temp', recursive=True)
You could also use a 3rd party library such as the wcmatch library that has already implemented this behavior (which I am the author of). But in this simple case, your listpath approach may be sufficient.

Related

Python: get a complete file name based on a partial file name

In a directory, there are two files that share most of their names:
my_file_0_1.txt
my_file_word_0_1.txt
I would like to open my_file_0_1.txt
I need to avoid specifying the exact filename, and instead need to search the directory for a filename that matches the partial string my_file_0.
From this answer here, and this one, I tried the following:
import numpy as np
import os, fnmatch, glob
def find(pattern, path):
result = []
for root, dirs, files in os.walk(path):
for name in files:
if fnmatch.fnmatch(name, pattern):
result.append(os.path.join(root, name))
return result
if __name__=='__main__':
#filename=find('my_file_0*.txt', '/path/to/file')
#print filename
print glob.glob('my_file_0' + '*' + '.txt')
Neither of these would print the actual filename, for me to read in later using np.loadtxt.
How can I find and store the name of a file, based on the result of a string match?
glob.glob() needs a path to be efficient, if you are running the script in another directory, it will not find what you expect.
(you can check the current directory with os.getcwd())
It should work with the line below :
print glob.glob('path/to/search/my_file_0' + '*.txt')
or
print glob.glob(r'C:\path\to\search\my_file_0' + '*.txt') # for windows
A solution using os.listdir()
Could you not also use the os module to search through os.listdir()? So for instance:
import os
partialFileName = "my_file_0"
for f in os.listdir():
if partialFileName = f[:len(partialFileName)]:
print(f)
I just developed the approach below and was doing a search to see if there was a better way and came across your question. I think you may like this approach. I needed pretty much the same thing you are asking for and came up with this clean one liner using list comprehension and a sure expectation that there would only be one file name matching my criteria. I modified my code to match your question.
import os
file_name = [n for n in os.listdir("C:/Your/Path") if 'my_file_0' in n][0]
print(file)
Now, if this is in a looping / repeated call situation, you can modify as below:
for i in range(1, 4):
file = [n for n in os.listdir("C:/Your/Path") if f'my_file_{i}' in n][0]
print(file)
or, probably more practically ...
def get_file_name_with_number(num):
file = [n for n in os.listdir("C:/Your/Path") if f'my_file_{num}' in n][0]
return file
print(get_file_name_with_number(0))

How to find files and skip directories in os.listdir

I use os.listdir and it works fine, but I get sub-directories in the list also, which is not what I want: I need only files.
What function do I need to use for that?
I looked also at os.walk and it seems to be what I want, but I'm not sure of how it works.
You need to filter out directories; os.listdir() lists all names in a given path. You can use os.path.isdir() for this:
basepath = '/path/to/directory'
for fname in os.listdir(basepath):
path = os.path.join(basepath, fname)
if os.path.isdir(path):
# skip directories
continue
Note that this only filters out directories after following symlinks. fname is not necessarily a regular file, it could also be a symlink to a file. If you need to filter out symlinks as well, you'd need to use not os.path.islink() first.
On a modern Python version (3.5 or newer), an even better option is to use the os.scandir() function; this produces DirEntry() instances. In the common case, this is faster as the direntry loaded already has cached enough information to determine if an entry is a directory or not:
basepath = '/path/to/directory'
for entry in os.scandir(basepath):
if entry.is_dir():
# skip directories
continue
# use entry.path to get the full path of this entry, or use
# entry.name for the base filename
You can use entry.is_file(follow_symlinks=False) if only regular files (and not symlinks) are needed.
os.walk() does the same work under the hood; unless you need to recurse down subdirectories, you don't need to use os.walk() here.
Here is a nice little one-liner in the form of a list comprehension:
[f for f in os.listdir(your_directory) if os.path.isfile(os.path.join(your_directory, f))]
This will return a list of filenames within the specified your_directory.
import os
directoryOfChoice = "C:\\" # Replace with a directory of choice!!!
filter(os.path.isfile, os.listdir(directoryOfChoice))
P.S: os.getcwd() returns the current directory.
for fname in os.listdir('.'):
if os.path.isdir(fname):
pass # do your stuff here for directory
else:
pass # do your stuff here for regular file
The solution with os.walk() would be:
for r, d, f in os.walk('path/to/dir'):
for files in f:
# This will list all files given in a particular directory
Even though this is an older post, let me please add the pathlib library introduced in 3.4 which provides an OOP style of handling directories and files for sakes of completeness. To get all files in a directory, you can use
def get_list_of_files_in_dir(directory: str, file_types: str ='*') -> list:
return [f for f in Path(directory).glob(file_types) if f.is_file()]
Following your example, you could use it like this:
mypath = '/path/to/directory'
files = get_list_of_files_in_dir(mypath)
If you only want a subset of files depending on the file extension (e.g. "only csv files"), you can use:
files = get_list_of_files_in_dir(mypath, '*.csv')
Note PEP 471 DirEntry object attributes is: is_dir(*, follow_symlinks=True)
so...
from os import scandir
folder = '/home/myfolder/'
for entry in scandir(folder):
if entry.is_dir():
# do code or skip
continue
myfile = folder + entry.name
#do something with myfile

Iterating through directories with Python

I need to iterate through the subdirectories of a given directory and search for files. If I get a file I have to open it and change the content and replace it with my own lines.
I tried this:
import os
rootdir ='C:/Users/sid/Desktop/test'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f=open(file,'r')
lines=f.readlines()
f.close()
f=open(file,'w')
for line in lines:
newline = "No you are not"
f.write(newline)
f.close()
but I am getting an error. What am I doing wrong?
The actual walk through the directories works as you have coded it. If you replace the contents of the inner loop with a simple print statement you can see that each file is found:
import os
rootdir = 'C:/Users/sid/Desktop/test'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
print(os.path.join(subdir, file))
If you still get errors when running the above, please provide the error message.
Another way of returning all files in subdirectories is to use the pathlib module, introduced in Python 3.4, which provides an object oriented approach to handling filesystem paths (Pathlib is also available on Python 2.7 via the pathlib2 module on PyPi):
from pathlib import Path
rootdir = Path('C:/Users/sid/Desktop/test')
# Return a list of regular files only, not directories
file_list = [f for f in rootdir.glob('**/*') if f.is_file()]
# For absolute paths instead of relative the current dir
file_list = [f for f in rootdir.resolve().glob('**/*') if f.is_file()]
Since Python 3.5, the glob module also supports recursive file finding:
import os
from glob import iglob
rootdir_glob = 'C:/Users/sid/Desktop/test/**/*' # Note the added asterisks
# This will return absolute paths
file_list = [f for f in iglob(rootdir_glob, recursive=True) if os.path.isfile(f)]
The file_list from either of the above approaches can be iterated over without the need for a nested loop:
for f in file_list:
print(f) # Replace with desired operations
From python >= 3.5 onward, you can use **, glob.iglob(path/**, recursive=True) and it seems the most pythonic solution, i.e.:
import glob, os
for filename in glob.iglob('/pardadox-music/**', recursive=True):
if os.path.isfile(filename): # filter dirs
print(filename)
Output:
/pardadox-music/modules/her1.mod
/pardadox-music/modules/her2.mod
...
Notes:
glob.iglob
glob.iglob(pathname, recursive=False)
Return an iterator which yields the same values as glob() without actually storing them all simultaneously.
If recursive is True, the pattern '**' will match any files and
zero or more directories and subdirectories.
If the directory contains files starting with . they won’t be matched by default. For example, consider a directory containing card.gif and .card.gif:
>>> import glob
>>> glob.glob('*.gif') ['card.gif']
>>> glob.glob('.c*')['.card.gif']
You can also use rglob(pattern),
which is the same as calling glob() with **/ added in front of the given relative pattern.

How can I search sub-folders using glob.glob module? [duplicate]

This question already has answers here:
How to use glob() to find files recursively?
(28 answers)
Closed 1 year ago.
I want to open a series of subfolders in a folder and find some text files and print some lines of the text files. I am using this:
configfiles = glob.glob('C:/Users/sam/Desktop/file1/*.txt')
But this cannot access the subfolders as well. Does anyone know how I can use the same command to access subfolders as well?
In Python 3.5 and newer use the new recursive **/ functionality:
configfiles = glob.glob('C:/Users/sam/Desktop/file1/**/*.txt', recursive=True)
When recursive is set, ** followed by a path separator matches 0 or more subdirectories.
In earlier Python versions, glob.glob() cannot list files in subdirectories recursively.
In that case I'd use os.walk() combined with fnmatch.filter() instead:
import os
import fnmatch
path = 'C:/Users/sam/Desktop/file1'
configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in fnmatch.filter(files, '*.txt')]
This'll walk your directories recursively and return all absolute pathnames to matching .txt files. In this specific case the fnmatch.filter() may be overkill, you could also use a .endswith() test:
import os
path = 'C:/Users/sam/Desktop/file1'
configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in files if f.endswith('.txt')]
There's a lot of confusion on this topic. Let me see if I can clarify it (Python 3.7):
glob.glob('*.txt') :matches all files ending in '.txt' in current directory
glob.glob('*/*.txt') :same as 1
glob.glob('**/*.txt') :matches all files ending in '.txt' in the immediate subdirectories only, but not in the current directory
glob.glob('*.txt',recursive=True) :same as 1
glob.glob('*/*.txt',recursive=True) :same as 3
glob.glob('**/*.txt',recursive=True):matches all files ending in '.txt' in the current directory and in all subdirectories
So it's best to always specify recursive=True.
To find files in immediate subdirectories:
configfiles = glob.glob(r'C:\Users\sam\Desktop\*\*.txt')
For a recursive version that traverse all subdirectories, you could use ** and pass recursive=True since Python 3.5:
configfiles = glob.glob(r'C:\Users\sam\Desktop\**\*.txt', recursive=True)
Both function calls return lists. You could use glob.iglob() to return paths one by one. Or use pathlib:
from pathlib import Path
path = Path(r'C:\Users\sam\Desktop')
txt_files_only_subdirs = path.glob('*/*.txt')
txt_files_all_recursively = path.rglob('*.txt') # including the current dir
Both methods return iterators (you can get paths one by one).
The glob2 package supports wild cards and is reasonably fast
code = '''
import glob2
glob2.glob("files/*/**")
'''
timeit.timeit(code, number=1)
On my laptop it takes approximately 2 seconds to match >60,000 file paths.
You can use Formic with Python 2.6
import formic
fileset = formic.FileSet(include="**/*.txt", directory="C:/Users/sam/Desktop/")
Disclosure - I am the author of this package.
Here is a adapted version that enables glob.glob like functionality without using glob2.
def find_files(directory, pattern='*'):
if not os.path.exists(directory):
raise ValueError("Directory not found {}".format(directory))
matches = []
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
full_path = os.path.join(root, filename)
if fnmatch.filter([full_path], pattern):
matches.append(os.path.join(root, filename))
return matches
So if you have the following dir structure
tests/files
├── a0
│   ├── a0.txt
│   ├── a0.yaml
│   └── b0
│   ├── b0.yaml
│   └── b00.yaml
└── a1
You can do something like this
files = utils.find_files('tests/files','**/b0/b*.yaml')
> ['tests/files/a0/b0/b0.yaml', 'tests/files/a0/b0/b00.yaml']
Pretty much fnmatch pattern match on the whole filename itself, rather than the filename only.
(The first options are of course mentioned in other answers, here the goal is to show that glob uses os.scandir internally, and provide a direct answer with this).
Using glob
As explained before, with Python 3.5+, it's easy:
import glob
for f in glob.glob('d:/temp/**/*', recursive=True):
print(f)
#d:\temp\New folder
#d:\temp\New Text Document - Copy.txt
#d:\temp\New folder\New Text Document - Copy.txt
#d:\temp\New folder\New Text Document.txt
Using pathlib
from pathlib import Path
for f in Path('d:/temp').glob('**/*'):
print(f)
Using os.scandir
os.scandir is what glob does internally. So here is how to do it directly, with a use of yield:
def listpath(path):
for f in os.scandir(path):
f2 = os.path.join(path, f)
if os.path.isdir(f):
yield f2
yield from listpath(f2)
else:
yield f2
for f in listpath('d:\\temp'):
print(f)
configfiles = glob.glob('C:/Users/sam/Desktop/**/*.txt")
Doesn't works for all cases, instead use glob2
configfiles = glob2.glob('C:/Users/sam/Desktop/**/*.txt")
If you can install glob2 package...
import glob2
filenames = glob2.glob("C:\\top_directory\\**\\*.ext") # Where ext is a specific file extension
folders = glob2.glob("C:\\top_directory\\**\\")
All filenames and folders:
all_ff = glob2.glob("C:\\top_directory\\**\\**")
If you're running Python 3.4+, you can use the pathlib module. The Path.glob() method supports the ** pattern, which means “this directory and all subdirectories, recursively”. It returns a generator yielding Path objects for all matching files.
from pathlib import Path
configfiles = Path("C:/Users/sam/Desktop/file1/").glob("**/*.txt")
You can use the function glob.glob() or glob.iglob() directly from glob module to retrieve paths recursively from inside the directories/files and subdirectories/subfiles.
Syntax:
glob.glob(pathname, *, recursive=False) # pathname = '/path/to/the/directory' or subdirectory
glob.iglob(pathname, *, recursive=False)
In your example, it is possible to write like this:
import glob
import os
configfiles = [f for f in glob.glob("C:/Users/sam/Desktop/*.txt")]
for f in configfiles:
print(f'Filename with path: {f}')
print(f'Only filename: {os.path.basename(f)}')
print(f'Filename without extensions: {os.path.splitext(os.path.basename(f))[0]}')
Output:
Filename with path: C:/Users/sam/Desktop/test_file.txt
Only filename: test_file.txt
Filename without extensions: test_file
Help:
Documentation for os.path.splitext and documentation for os.path.basename.
As pointed out by Martijn, glob can only do this through the **operator introduced in Python 3.5. Since the OP explicitly asked for the glob module, the following will return a lazy evaluation iterator that behaves similarly
import os, glob, itertools
configfiles = itertools.chain.from_iterable(glob.iglob(os.path.join(root,'*.txt'))
for root, dirs, files in os.walk('C:/Users/sam/Desktop/file1/'))
Note that you can only iterate once over configfiles in this approach though. If you require a real list of configfiles that can be used in multiple operations you would have to create this explicitly by using list(configfiles).
The command rglob will do an infinite recursion down the deepest sub-level of your directory structure. If you only want one level deep, then do not use it, however.
I realize the OP was talking about using glob.glob. I believe this answers the intent, however, which is to search all subfolders recursively.
The rglob function recently produced a 100x increase in speed for a data processing algorithm which was using the folder structure as a fixed assumption for the order of data reading. However, with rglob we were able to do a single scan once through all files at or below a specified parent directory, save their names to a list (over a million files), then use that list to determine which files we needed to open at any point in the future based on the file naming conventions only vs. which folder they were in.

Simplest way to get the equivalent of "find ." in python?

What is the simplest way to get the full recursive list of files inside a folder with python? I know about os.walk(), but it seems overkill for just getting the unfiltered list of all files. Is it really the only option?
There's nothing preventing you from creating your own function:
import os
def listfiles(folder):
for root, folders, files in os.walk(folder):
for filename in folders + files:
yield os.path.join(root, filename)
You can use it like so:
for filename in listfiles('/etc/'):
print filename
os.walk() is not overkill by any means. It can generate your list of files and directories in a jiffy:
files = [os.path.join(dirpath, filename)
for (dirpath, dirs, files) in os.walk('.')
for filename in (dirs + files)]
You can turn this into a generator, to only process one path at a time and safe on memory.
You could also use the find program itself from Python by using sh
import sh
text_files = sh.find(".", "-iname", "*.txt")
Either that or manually recursing with isdir() / isfile() and listdir() or you could use subprocess.check_output() and call find .. Bascially os.walk() is highest level, slightly lower level is semi-manual solution based on listdir() and if you want the same output find . would give you for some reason you can make a system call with subprocess.
pathlib.Path.rglob is pretty simple. It lists the entire directory tree
(The argument is a filepath search pattern. "*" means list everything)
import pathlib
for path in pathlib.Path("directory_to_list/").rglob("*"):
print(path)
os.walk() is hard to use, just kick it and use pathlib instead.
Here is a python function mimicking a similar function of list.files in R language.
def list_files(path,pattern,full_names=False,recursive=True):
if(recursive):
files=pathlib.Path(path).rglob(pattern)
else:
files=pathlib.Path(path).glob(pattern)
if full_names:
files=[str(f) for f in files]
else:
files=[f.name for f in files]
return(files)
import os
path = "path/to/your/dir"
for (path, dirs, files) in os.walk(path):
print files
Is this overkill, or am I missing something?

Categories

Resources