glob exclude pattern - python

I have a directory with a bunch of files inside: eee2314, asd3442 ... and eph.
I want to exclude all files that start with eph with the glob function.
How can I do it?

The pattern rules for glob are not regular expressions. Instead, they follow standard Unix path expansion rules. There are only a few special characters: two different wild-cards, and character ranges are supported [from pymotw: glob – Filename pattern matching].
So you can exclude some files with patterns.
For example to exclude manifests files (files starting with _) with glob, you can use:
files = glob.glob('files_path/[!_]*')

You can deduct sets and cast it back as a list:
list(set(glob("*")) - set(glob("eph*")))

You can't exclude patterns with the glob function, globs only allow for inclusion patterns. Globbing syntax is very limited (even a [!..] character class must match a character, so it is an inclusion pattern for every character that is not in the class).
You'll have to do your own filtering; a list comprehension usually works nicely here:
files = [fn for fn in glob('somepath/*.txt')
if not os.path.basename(fn).startswith('eph')]

Compared with glob, I recommend pathlib. Filtering one pattern is very simple.
from pathlib import Path
p = Path(YOUR_PATH)
filtered = [x for x in p.glob("**/*") if not x.name.startswith("eph")]
And if you want to filter a more complex pattern, you can define a function to do that, just like:
def not_in_pattern(x):
return (not x.name.startswith("eph")) and not x.name.startswith("epi")
filtered = [x for x in p.glob("**/*") if not_in_pattern(x)]
Using that code, you can filter all files that start with eph or start with epi.

Late to the game but you could alternatively just apply a python filter to the result of a glob:
files = glob.iglob('your_path_here')
files_i_care_about = filter(lambda x: not x.startswith("eph"), files)
or replacing the lambda with an appropriate regex search, etc...
EDIT: I just realized that if you're using full paths the startswith won't work, so you'd need a regex
In [10]: a
Out[10]: ['/some/path/foo', 'some/path/bar', 'some/path/eph_thing']
In [11]: filter(lambda x: not re.search('/eph', x), a)
Out[11]: ['/some/path/foo', 'some/path/bar']

How about skipping the particular file while iterating over all the files in the folder!
Below code would skip all excel files that start with 'eph'
import glob
import re
for file in glob.glob('*.xlsx'):
if re.match('eph.*\.xlsx',file):
continue
else:
#do your stuff here
print(file)
This way you can use more complex regex patterns to include/exclude a particular set of files in a folder.

More generally, to exclude files that don't comply with some shell regexp, you could use module fnmatch:
import fnmatch
file_list = glob('somepath')
for ind, ii in enumerate(file_list):
if not fnmatch.fnmatch(ii, 'bash_regexp_with_exclude'):
file_list.pop(ind)
The above will first generate a list from a given path and next pop out the files that won't satisfy the regular expression with the desired constraint.

Suppose you have this directory structure:
.
├── asd3442
├── eee2314
├── eph334
├── eph_dir
│   ├── asd330
│   ├── eph_file2
│   ├── exy123
│   └── file_with_eph
├── eph_file
├── not_eph_dir
│   ├── ephXXX
│   └── with_eph
└── not_eph_rest
You can use full globs to filter full path results with pathlib and a generator for the top level directory:
i_want=(fn for fn in Path(path_to).glob('*') if not fn.match('**/*/eph*'))
>>> list(i_want)
[PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), PosixPath('/tmp/test/not_eph_rest'), PosixPath('/tmp/test/not_eph_dir')]
The pathlib method match uses globs to match a path object; The glob '**/*/eph*' is any full path that leads to a file with a name starting with 'eph'.
Alternatively, you can use the .name attribute with name.startswith('eph'):
i_want=(fn for fn in Path(path_to).glob('*') if not fn.name.startswith('eph'))
If you want only files, no directories:
i_want=(fn for fn in Path(path_to).glob('*') if fn.is_file() and not fn.match('**/*/eph*'))
# [PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), PosixPath('/tmp/test/not_eph_rest')]
The same method works for recursive globs:
i_want=(fn for fn in Path(path_to).glob('**/*')
if fn.is_file() and not fn.match('**/*/eph*'))
# [PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'),
PosixPath('/tmp/test/not_eph_rest'), PosixPath('/tmp/test/eph_dir/asd330'),
PosixPath('/tmp/test/eph_dir/file_with_eph'), PosixPath('/tmp/test/eph_dir/exy123'),
PosixPath('/tmp/test/not_eph_dir/with_eph')]

As mentioned by the accepted answer, you can't exclude patterns with glob, so the following is a method to filter your glob result.
The accepted answer is probably the best pythonic way to do things but if you think list comprehensions look a bit ugly and want to make your code maximally numpythonic anyway (like I did) then you can do this (but note that this is probably less efficient than the list comprehension method):
import glob
data_files = glob.glob("path_to_files/*.fits")
light_files = np.setdiff1d( data_files, glob.glob("*BIAS*"))
light_files = np.setdiff1d(light_files, glob.glob("*FLAT*"))
(In my case, I had some image frames, bias frames, and flat frames all in one directory and I just wanted the image frames)

If the position of the character isn't important, that is for example to exclude manifests files (wherever it is found _) with glob and re - regular expression operations, you can use:
import glob
import re
for file in glob.glob('*.txt'):
if re.match(r'.*\_.*', file):
continue
else:
print(file)
Or with in a more elegant way - list comprehension
filtered = [f for f in glob.glob('*.txt') if not re.match(r'.*\_.*', f)]
for mach in filtered:
print(mach)

To exclude exact word you may want to implement custom regex directive, which you will then replace by empty string before glob processing.
#!/usr/bin/env python3
import glob
import re
# glob (or fnmatch) does not support exact word matching. This is custom directive to overcome this issue
glob_exact_match_regex = r"\[\^.*\]"
path = "[^exclude.py]*py" # [^...] is a custom directive, that excludes exact match
# Process custom directive
try: # Try to parse exact match direction
exact_match = re.findall(glob_exact_match_regex, path)[0].replace('[^', '').replace(']', '')
except IndexError:
exact_match = None
else: # Remove custom directive
path = re.sub(glob_exact_match_regex, "", path)
paths = glob.glob(path)
# Implement custom directive
if exact_match is not None: # Exclude all paths with specified string
paths = [p for p in paths if exact_match not in p]
print(paths)

import glob
import re
""" This is a path that should be excluded """
EXCLUDE = "/home/koosha/Documents/Excel"
files = glob.glob("/home/koosha/Documents/**/*.*" , recursive=True)
for file in files:
if re.search(EXCLUDE,file):
pass
else:
print(file)

Related

Deleting the useless output files using Python

After I execute a python script from a particular directory, I get many output files but apart from 5-6 files I want to delete the rest from that directory. What I have done is, I have taken those 5-6 useful files inside a list and deleted all the other files which are not there in that list. Below is my code:
list1=['prog_1.py', 'prog_2.py', 'prog_3.py'] #Extend
import os
dir = '/home/dev/codes' #Change accordingly
for f in os.listdir(dir):
if f not in list1:
os.remove(os.path.join(dir, f))
Now here I just want to add one more thing, if the output files start with output_of_final, then I don't want them to be deleted. How can I do it? Should I use regex?
You could use Regex, but that's overkill here. Just use the str.startswith method.
Also, it's bad practice to use reserved keywords, built-in types and functions as variable names. I have renamed dir to directory. (https://docs.python.org/3/library/functions.html#dir)
list1 = ['prog_1.py', 'prog_2.py', 'prog_3.py'] # Extend
import os
directory = '/home/dev/codes' # Change accordingly
for f in os.listdir(directory):
if f not in list1 and not f.startswith('output_of_final'):
os.remove(os.path.join(directory, f))
yes the regex works here, but there are easier options like using startswith method for strings
list1=['prog_1.py', 'prog_2.py', 'prog_3.py'] #Extend
import os
dir = '/home/dev/codes' #Change accordingly
for f in os.listdir(dir):
if (f not in list1) and (not f.startswith('output_of_final')):
os.remove(os.path.join(dir, f))

Python RE Directories and slashes

Let's say I have a string that is a root directory that has been entered
'C:/Users/Me/'
Then I use os.listdir() and join with it to create a list of subdirectories.
I end up with a list of strings that are like below:
'C:/Users/Me/Adir\Asubdir\'
and so on.
I want to split the subdirectories and capture each directory name as its own element. Below is one attempt. I am seemingly having issues with the \ and / characters. I assume \ is escaping, so '[\\/]' to me that says look for \ or / so then '[\\/]([\w\s]+)[\\/]' as a match pattern should look for any word between two slashes... but the output is only ['/Users/'] and nothing else is matched. So I then I add a escape for the forward slash.
'[\\\/]([\w\s]+)[\\\/]'
However, my output then only becomes ['Users','ADir'] so that is confusing the crud out of me.
My question is namely how do I tokenize each directory from a string using both \ and / but maybe also why is my RE not working as I expect?
Minimal Example:
import re, os
info = re.compile('[\\\/]([\w ]+)[\\\/]')
root = 'C:/Users/i12500198/Documents/Projects/'
def getFiles(wdir=os.getcwd()):
files = (os.path.join(wdir,file) for file in os.listdir(wdir)
if os.path.isfile(os.path.join(wdir,file)))
return list(files)
def getDirs(wdir=os.getcwd()):
dirs = (os.path.join(wdir,adir) for adir in os.listdir(wdir)
if os.path.isdir(os.path.join(wdir,adir)))
return list(dirs)
def walkSubdirs(root,below=[]):
subdirs = getDirs(root)
for aDir in subdirs:
below.append(aDir)
walkSubdirs(aDir,below)
return below
subdirs = walkSubdirs(root)
for aDir in subdirs:
files = getFiles(aDir)
for f in files:
finfo = info.findall(f)
print(f)
print(finfo)
I want to split the subdirectories and capture each directory name as its own element
Instead of regular expressions, I suggest you use one of Python's standard functions for parsing filesystem paths.
Here is one using pathlib:
from pathlib import Path
p = Path("C:/Users/Me/ADir\ASub Dir\2 x 2 Dir\\")
p.parts
#=> ('C:\\', 'Users', 'Me', 'ADir', 'ASub Dir\x02 x 2 Dir')
Note that the behaviour of pathlib.Path depends on the system running Python. Since I'm on a Linux machine, I actually used pathlib.PureWindowsPath here. I believe the output should be accurate for those of you on Windows.

In Python, How do I check whether a file exists starting or ending with a substring?

I know about os.path.isfile(fname), but now I need to search if a file exists that is named FILEnTEST.txt where n could be any positive integer (so it could be FILE1TEST.txt or FILE9876TEST.txt)
I guess a solution to this could involve substrings that the filename starts/ends with OR one that involves somehow calling os.path.isfile('FILE' + n + 'TEST.txt') and replacing n with any number, but I don't know how to approach either solution.
You would need to write your own filtering system, by getting all the files in a directory and then matching them to a regex string and seeing if they fail the test or not:
import re
pattern = re.compile("FILE\d+TEST.txt")
dir = "/test/"
for filepath in os.listdir(dir):
if pattern.match(filepath):
#do stuff with matching file
I'm not near a machine with Python installed on it to test the code, but it should be something along those lines.
You can use a regular expression:
/FILE\d+TEST.txt/
Example: regexr.com.
Then you can use said regular expression and iterate through all of the files in a directory.
import re
import os
filename_re = 'FILE\d+TEST.txt'
for filename in os.listdir(directory):
if re.search(filename_re, filename):
# this file has the form FILEnTEST.txt
# do what you want with it now
You can also do it as such:
import os
import re
if len([file for file in os.listdir(directory) if re.search('regex', file)]):
# there's at least 1 such file

Reading all files in a particular directory exclude some with particular regex in python

Need to read all files(with some exclusions) in particular directory parse the content of each and write to other file.
While I want to exclude some file based on some regex.I want to do so in python. Can anyone describe how to do that in python?
Concept script:
import re, os
files = os.listdir(path) # list of file returned
test = re.compile(...) # your reg expression
sublist = [f for f in files if test.search(f)]
Just use glob.glob()
import glob
glob.glob(path)
path can contain unix wildcards, not regex though.

How can I search sub-folders using glob.glob module? [duplicate]

This question already has answers here:
How to use glob() to find files recursively?
(28 answers)
Closed 1 year ago.
I want to open a series of subfolders in a folder and find some text files and print some lines of the text files. I am using this:
configfiles = glob.glob('C:/Users/sam/Desktop/file1/*.txt')
But this cannot access the subfolders as well. Does anyone know how I can use the same command to access subfolders as well?
In Python 3.5 and newer use the new recursive **/ functionality:
configfiles = glob.glob('C:/Users/sam/Desktop/file1/**/*.txt', recursive=True)
When recursive is set, ** followed by a path separator matches 0 or more subdirectories.
In earlier Python versions, glob.glob() cannot list files in subdirectories recursively.
In that case I'd use os.walk() combined with fnmatch.filter() instead:
import os
import fnmatch
path = 'C:/Users/sam/Desktop/file1'
configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in fnmatch.filter(files, '*.txt')]
This'll walk your directories recursively and return all absolute pathnames to matching .txt files. In this specific case the fnmatch.filter() may be overkill, you could also use a .endswith() test:
import os
path = 'C:/Users/sam/Desktop/file1'
configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in files if f.endswith('.txt')]
There's a lot of confusion on this topic. Let me see if I can clarify it (Python 3.7):
glob.glob('*.txt') :matches all files ending in '.txt' in current directory
glob.glob('*/*.txt') :same as 1
glob.glob('**/*.txt') :matches all files ending in '.txt' in the immediate subdirectories only, but not in the current directory
glob.glob('*.txt',recursive=True) :same as 1
glob.glob('*/*.txt',recursive=True) :same as 3
glob.glob('**/*.txt',recursive=True):matches all files ending in '.txt' in the current directory and in all subdirectories
So it's best to always specify recursive=True.
To find files in immediate subdirectories:
configfiles = glob.glob(r'C:\Users\sam\Desktop\*\*.txt')
For a recursive version that traverse all subdirectories, you could use ** and pass recursive=True since Python 3.5:
configfiles = glob.glob(r'C:\Users\sam\Desktop\**\*.txt', recursive=True)
Both function calls return lists. You could use glob.iglob() to return paths one by one. Or use pathlib:
from pathlib import Path
path = Path(r'C:\Users\sam\Desktop')
txt_files_only_subdirs = path.glob('*/*.txt')
txt_files_all_recursively = path.rglob('*.txt') # including the current dir
Both methods return iterators (you can get paths one by one).
The glob2 package supports wild cards and is reasonably fast
code = '''
import glob2
glob2.glob("files/*/**")
'''
timeit.timeit(code, number=1)
On my laptop it takes approximately 2 seconds to match >60,000 file paths.
You can use Formic with Python 2.6
import formic
fileset = formic.FileSet(include="**/*.txt", directory="C:/Users/sam/Desktop/")
Disclosure - I am the author of this package.
Here is a adapted version that enables glob.glob like functionality without using glob2.
def find_files(directory, pattern='*'):
if not os.path.exists(directory):
raise ValueError("Directory not found {}".format(directory))
matches = []
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
full_path = os.path.join(root, filename)
if fnmatch.filter([full_path], pattern):
matches.append(os.path.join(root, filename))
return matches
So if you have the following dir structure
tests/files
├── a0
│   ├── a0.txt
│   ├── a0.yaml
│   └── b0
│   ├── b0.yaml
│   └── b00.yaml
└── a1
You can do something like this
files = utils.find_files('tests/files','**/b0/b*.yaml')
> ['tests/files/a0/b0/b0.yaml', 'tests/files/a0/b0/b00.yaml']
Pretty much fnmatch pattern match on the whole filename itself, rather than the filename only.
(The first options are of course mentioned in other answers, here the goal is to show that glob uses os.scandir internally, and provide a direct answer with this).
Using glob
As explained before, with Python 3.5+, it's easy:
import glob
for f in glob.glob('d:/temp/**/*', recursive=True):
print(f)
#d:\temp\New folder
#d:\temp\New Text Document - Copy.txt
#d:\temp\New folder\New Text Document - Copy.txt
#d:\temp\New folder\New Text Document.txt
Using pathlib
from pathlib import Path
for f in Path('d:/temp').glob('**/*'):
print(f)
Using os.scandir
os.scandir is what glob does internally. So here is how to do it directly, with a use of yield:
def listpath(path):
for f in os.scandir(path):
f2 = os.path.join(path, f)
if os.path.isdir(f):
yield f2
yield from listpath(f2)
else:
yield f2
for f in listpath('d:\\temp'):
print(f)
configfiles = glob.glob('C:/Users/sam/Desktop/**/*.txt")
Doesn't works for all cases, instead use glob2
configfiles = glob2.glob('C:/Users/sam/Desktop/**/*.txt")
If you can install glob2 package...
import glob2
filenames = glob2.glob("C:\\top_directory\\**\\*.ext") # Where ext is a specific file extension
folders = glob2.glob("C:\\top_directory\\**\\")
All filenames and folders:
all_ff = glob2.glob("C:\\top_directory\\**\\**")
If you're running Python 3.4+, you can use the pathlib module. The Path.glob() method supports the ** pattern, which means “this directory and all subdirectories, recursively”. It returns a generator yielding Path objects for all matching files.
from pathlib import Path
configfiles = Path("C:/Users/sam/Desktop/file1/").glob("**/*.txt")
You can use the function glob.glob() or glob.iglob() directly from glob module to retrieve paths recursively from inside the directories/files and subdirectories/subfiles.
Syntax:
glob.glob(pathname, *, recursive=False) # pathname = '/path/to/the/directory' or subdirectory
glob.iglob(pathname, *, recursive=False)
In your example, it is possible to write like this:
import glob
import os
configfiles = [f for f in glob.glob("C:/Users/sam/Desktop/*.txt")]
for f in configfiles:
print(f'Filename with path: {f}')
print(f'Only filename: {os.path.basename(f)}')
print(f'Filename without extensions: {os.path.splitext(os.path.basename(f))[0]}')
Output:
Filename with path: C:/Users/sam/Desktop/test_file.txt
Only filename: test_file.txt
Filename without extensions: test_file
Help:
Documentation for os.path.splitext and documentation for os.path.basename.
As pointed out by Martijn, glob can only do this through the **operator introduced in Python 3.5. Since the OP explicitly asked for the glob module, the following will return a lazy evaluation iterator that behaves similarly
import os, glob, itertools
configfiles = itertools.chain.from_iterable(glob.iglob(os.path.join(root,'*.txt'))
for root, dirs, files in os.walk('C:/Users/sam/Desktop/file1/'))
Note that you can only iterate once over configfiles in this approach though. If you require a real list of configfiles that can be used in multiple operations you would have to create this explicitly by using list(configfiles).
The command rglob will do an infinite recursion down the deepest sub-level of your directory structure. If you only want one level deep, then do not use it, however.
I realize the OP was talking about using glob.glob. I believe this answers the intent, however, which is to search all subfolders recursively.
The rglob function recently produced a 100x increase in speed for a data processing algorithm which was using the folder structure as a fixed assumption for the order of data reading. However, with rglob we were able to do a single scan once through all files at or below a specified parent directory, save their names to a list (over a million files), then use that list to determine which files we needed to open at any point in the future based on the file naming conventions only vs. which folder they were in.

Categories

Resources