I came across this in IBM Applied AI course:
path_for_license_plates = os.getcwd() + "/license-plates/**/*.jpg"
what does **/*.jpg mean in the above path ?
From https://docs.python.org/3/library/glob.html :
glob.glob(pathname, *, recursive=False)
If recursive is true, the
pattern “**” will match any files and zero or more directories,
subdirectories and symbolic links to directories. If the pattern is
followed by an os.sep or os.altsep then files will not match.
It is apparently meant to be a glob pattern in "recursive" mode, as
the "**" suggests.
Given the directory tree
license-plates/
├── a
│ ├── b
│ │ └── x.jpg
│ └── x.jpg
└── x.jpg
The function
glob.glob()
works as follows,
>>> import glob
>>> glob.glob('license-plates/**/*.jpg', recursive=True)
['license-plates/x.jpg', 'license-plates/a/x.jpg', 'license-plates/a/b/x.jpg']
path_for_license_plates is literally a string.
That's it.
It can be used to say get me all the jpg files recursively looking in all the directories under "license-plates".
A better question is "how is it going to be used later in the program?"
It's likely (since they used the os module) this is an older program. This tends to use the glob module as others have shown. But if you are changing this program you can modernize it.
with modern python (3.6+) here is how you can ask for the same information:
from pathlib import Path
path_for_license_plates = Path('license-plates').glob("**/*.jpg")
for license_plate_file_location in path_for_license_plates:
print(license_plate_file_location)
This will assume the license-plates is in the current working directory and give you a generator that will yield a much shorter code and this will work on the major filesystems as well. (windows/linux/mac)
Related
In order to order some files into folders, I have to get the number (as if it was some sort of ID) of both folders (named as p.X, p. fixed, being X a number that can range from 1 to 200150) and files (being PX_N.gmspr, where P is fixed, X is the ID number of the folder and N an identifier of the file, which can be 2,3,6,8,9,A and H).
An example would be p.24 and P24_2.gmspr, P24_3.gmspr, P24_6.gmspr, P24_8.gmspr, P24_9.gmspr, P24_A.gmspr and P24_H.gmspr, in order to move all P24_N.gmspr to p.24
The PX_N.gmspr files are in a different folder than the target folders p.X . A little of os.chdir and os.rename and the files can be moved easily so I believe that is not a problem.
What I want is to obtain the X number of the filename to compare with the folder number, forgetting about both the P and the _N.gmspr string.
Whereas I can obtain the folder number via
foldername.split(".",1)[1] I don't really know how to do it for the file number.
To sum up, I want to move some files called PX_N.gmspr to another folder identified almost the same p.X
Any idea? Thank you!!!
EDIT:
Regarding the answer given, I have to clarify myself about what I am trying to do, specially with the file and folder format:
Mother folder
├── Unclassified
│ └── All PX_N.gmspr being PX certain files that gotta be moved to another folders, X a number that ranges from 1 to 200150 (but not exactly 200150, is just a number ID) and N can be only 2, 3, 6, 9, A or H, nothing more. In total 15435 elements with each of the X having one of the 6 possibles N gmspr.
├──First Folder
│ └── p.X folders (X from 1 to 151), the aim is to select all the PX_N.gmspr files that agree with the X number that matches the p.X of the folder and move it to each folder.
├──Second Folder
│ └── p.X folders (X from 152 to 251, plus p.602 to p.628, p.823, p.824,
│ p.825, p.881 and p.882)
└──Third Folder
└── p.X folders (X from 252 to 386, plus p.585, p.586 and p. 587)
There are some other folders in order to order some more of the 15435 files.
I am currently searching about regex; unluckily for me, it is the first time I actually have to use them.
EDIT CAUSE SOLVED: SO THE POINT WAS TO PLAY WITH REGEX AND GETTING ONLY THE NUMBERS, BUT THEN AS NESTED LISTS APPEARED, ONLY THE FIRST NUMBER WAS USEFUL
This is the perfect job for regexes.
First, let's create a temporary dir and fill it with some files to demonstrate.
from pathlib import Path
from random import choices, randint
from string import ascii_letters
from tempfile import TemporaryDirectory
tmpdir = TemporaryDirectory()
for i in range(4):
n = randint(1, 999)
for i in range(randint(1, 5)):
Path(
tmpdir.name, f"P{n}.{''.join(choices(ascii_letters, k=10))}"
).touch()
Now we have 4 types of file (PN.), with between 1 and 5 files in this type.
Then, we just need to iterate through those file, extract the N from the file name with the regex P(\d+)\..+, and finally create destination dir and move the file.
from pathlib import Path
import re
dir_re = re.compile(r"P(\d+)\..+")
for filepath in Path(tmpdir.name).iterdir():
m = dir_re.match(filepath.name)
dirpath = filepath.parent / f"p.{m.group(1)}"
if not dirpath.is_dir():
dirpath.mkdir()
filepath.rename(dirpath / filepath.name)
For instance, from a flat temp directory, we have now the following sorted.
/var/folders/lf/z7ftpkws0vn7svq8n212czm40000gn/T/tmppve5_m1u/
├── p.413
│ └── P413.yJvxPtuzfz
├── p.705
│ ├── P705.DbwPyiFxum
│ ├── P705.FVwMuSqFms
│ ├── P705.PZyGIQEqSG
│ ├── P705.baRrkcNaZR
│ └── P705.tZKFTKwDah
├── p.794
│ ├── P794.CQTBgXOckQ
│ ├── P794.JNoKsUtgRU
│ └── P794.iSdrdohKYq
└── p.894
└── P894.XbzFxnqYOY
And finally, cleanup the temporary directory.
tmpdir.cleanup()
I prefer to work mainly with pathlib.Paths. However, in the following example code, I ended up comparing strings like this:
if str(dir2).startswith(str(dir1))
dir1 and dir2 are Paths. Is there a better way to find out if dir2 is an arbitrarily nested subdirectory of dir1?
from pathlib import Path
mydirs = [Path('/a/a/a/a'), Path('/a/a/'), Path('/a/b/'), Path('/a/a/b'), Path('/b/a/a/'), Path('/a/a/a/a/a/a/'), Path('/a/b/a/a/a/')]
mydirs.sort(key = lambda x: len(x.parts))
roots = mydirs.copy()
for dir1 in mydirs:
for dir2 in mydirs:
if dir1 == dir2 or len(dir2.parts) <= len(dir1.parts):
continue
if str(dir2).startswith(str(dir1)):
roots.remove(dir2)
print(f'pruned subdir {str(dir2)} of {str(dir1)}')
print(roots)
I think you're looking for is_relative_to():
if dir2.is_relative_to(dir1):
# Do something
Depending on your exact use case you might find it helpful to resolve() one or both of the Paths:
if dir2.resolve().is_relative_to(dir1.resolve())
This should ensure that something like /foo/../bar/baz is seen as relative to /quux/../bar.
Note that is_relative_to() was introduced in Python 3.9.
Instead of
str(dir2).startswith(str(dir1))
try
dir1 in dir2.parents
If you have many paths, you probably shouldn't try all pairs. Here's a set solution:
mydirs.sort(key=lambda dir: len(dir.parts))
roots = set()
for dir in mydirs:
if roots.isdisjoint(dir.parents):
roots.add(dir)
Btw, your solution is buggy, as you're incorrectly removing from a list while iterating it. For example for
mydirs = [Path('/a/'), Path('/a/a/'), Path('/a/b/')]
you only prune '/a/a/' but keep '/a/b/'.
If folder X is empty, I would like to delete X.
If folder Y only contains folder X and other folders that are empty, I would like to delete Y.
If folder Z only contains folders like X and/or Y, I would like to delete Z.
How do I do this recursively for anything under a specific dir, with Python?
I tried something like the following, but it is only able to identify folders like X, not folders like Y or Z.
from pathlib import Path
folder = '/home/abc'
for path in Path(folder).glob("**/*"):
if path.is_dir() and len(list(path.iterdir())) == 0:
logger.info(f"remove {path}")
May be a bit verbose, but this seems to do the job.
def dir_empty(path):
empty = True
for item in path.glob('*'):
if item.is_file():
empty = False
if item.is_dir() and not dir_empty(item):
empty = False
if empty:
path.rmdir() # Remove if you just want to have the result
return empty
from pathlib import Path
dir_empty(Path('Z'))
os.rmdir() will fail on any directory with contents. So one method here is to just rmdir every directory from bottom to top while suppressing the OSError exception which is thrown when attempting to remove a non-empty directory. All empty ones will be removed, all with contents will remain. Technically, checking if a directory is empty before attempting the removal is a race condition (though typically, a harmless one).
Let's take this filesystem with 2 files in it.
testtree/
├── a
│ └── aa
│ └── filea
├── b
│ ├── bb
│ └── fileb
└── c
└── cc
And run this:
import os
from pathlib import Path
from contextlib import suppress
for root,dirs,_ in os.walk("testtree", topdown=False):
for d in dirs:
with suppress(OSError):
os.rmdir(Path(root,d))
Then the tree is transformed to
testtree/
├── a
│ └── aa
│ └── filea
└── b
└── fileb
bb, cc, and c were all removed.
The problem here is that glob("**/*") appears to do a preorder traversal. In other words, it returns a parent folder before returning any children.
You need to do a postorder traversal instead. In your example, if you delete a folder like X, then Y just becomes the same as X.
You could do this with manual recursion. But if you want to use glob(), you need to reverse the items returned. You can do this with reversed(Path(folder.glob("**/*")).
In my Python3 program I need to delete files and folders that are older than X days. I know there a many similar questions here, but in my case I don't need to check the modification times of these folders and files. Instead I have the following folder structure:
/root_folder/<year>/<month>/<day>/<files>
So for example something like this:
.
└── 2020
├── 04
│ └── 30
│ ├── file.1
│ └── file.2
└── 05
├── 14
│ ├── file.1
│ └── file.2
├── 19
├── 21
│ └── file.1
└── 22
├── file.1
├── file.2
└── file.3
What I want now is to delete all the folders and their files that represent the date older than X days. I have created a solution, but coming from Java it seems to me that is not very Pythonic, or it might be easier to solve in Pyhton. Can you Python experts guide me a bit here, of course taking into account "jumps" over months and years?
Not a python expert here either, but here's something simple:
Find the date oldest date that you want to keep. Anything older than this will be deleted. Let's say it is the 28/04/2020
From that date, you can build a string "/root_folder/2020/04/28"
List all the files, if their path (as string) is less than the string from the previous step, you can delete them all
Example:
files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
Source of that snippet: https://mkyong.com/python/python-how-to-list-all-files-in-a-directory/
Now, you can do:
for f in files:
if f < date_limit:
os.remove(f)
Note: This is not optimal
It deletes file by file, but the moment you enter the if you could just delete the whole folder where this file is (but then the list of files points to files that have been deleted).
You actually don't care about the files. You could apply the logic to folders alone and remove them recursively.
Update: doing both steps as we browse the folders:
for r, d, f in os.walk(path):
if( r < date_limit ):
print(f"Deleting {r}")
shutil.rmtree(r)
Glob your paths to get your filepaths in an array then run it something like this (below), good luck!
def is_file_access_older_than(file_path, seconds, from_time=None):
"""based on st_atime --> https://docs.python.org/3/library/os.html#os.stat_result.st_atime"""
if not from_time:
from_time = time.time()
if (from_time - os.stat(file_path).st_atime) > seconds:
return file_path
return False
I have a directory with a bunch of files inside: eee2314, asd3442 ... and eph.
I want to exclude all files that start with eph with the glob function.
How can I do it?
The pattern rules for glob are not regular expressions. Instead, they follow standard Unix path expansion rules. There are only a few special characters: two different wild-cards, and character ranges are supported [from pymotw: glob – Filename pattern matching].
So you can exclude some files with patterns.
For example to exclude manifests files (files starting with _) with glob, you can use:
files = glob.glob('files_path/[!_]*')
You can deduct sets and cast it back as a list:
list(set(glob("*")) - set(glob("eph*")))
You can't exclude patterns with the glob function, globs only allow for inclusion patterns. Globbing syntax is very limited (even a [!..] character class must match a character, so it is an inclusion pattern for every character that is not in the class).
You'll have to do your own filtering; a list comprehension usually works nicely here:
files = [fn for fn in glob('somepath/*.txt')
if not os.path.basename(fn).startswith('eph')]
Compared with glob, I recommend pathlib. Filtering one pattern is very simple.
from pathlib import Path
p = Path(YOUR_PATH)
filtered = [x for x in p.glob("**/*") if not x.name.startswith("eph")]
And if you want to filter a more complex pattern, you can define a function to do that, just like:
def not_in_pattern(x):
return (not x.name.startswith("eph")) and not x.name.startswith("epi")
filtered = [x for x in p.glob("**/*") if not_in_pattern(x)]
Using that code, you can filter all files that start with eph or start with epi.
Late to the game but you could alternatively just apply a python filter to the result of a glob:
files = glob.iglob('your_path_here')
files_i_care_about = filter(lambda x: not x.startswith("eph"), files)
or replacing the lambda with an appropriate regex search, etc...
EDIT: I just realized that if you're using full paths the startswith won't work, so you'd need a regex
In [10]: a
Out[10]: ['/some/path/foo', 'some/path/bar', 'some/path/eph_thing']
In [11]: filter(lambda x: not re.search('/eph', x), a)
Out[11]: ['/some/path/foo', 'some/path/bar']
How about skipping the particular file while iterating over all the files in the folder!
Below code would skip all excel files that start with 'eph'
import glob
import re
for file in glob.glob('*.xlsx'):
if re.match('eph.*\.xlsx',file):
continue
else:
#do your stuff here
print(file)
This way you can use more complex regex patterns to include/exclude a particular set of files in a folder.
More generally, to exclude files that don't comply with some shell regexp, you could use module fnmatch:
import fnmatch
file_list = glob('somepath')
for ind, ii in enumerate(file_list):
if not fnmatch.fnmatch(ii, 'bash_regexp_with_exclude'):
file_list.pop(ind)
The above will first generate a list from a given path and next pop out the files that won't satisfy the regular expression with the desired constraint.
Suppose you have this directory structure:
.
├── asd3442
├── eee2314
├── eph334
├── eph_dir
│ ├── asd330
│ ├── eph_file2
│ ├── exy123
│ └── file_with_eph
├── eph_file
├── not_eph_dir
│ ├── ephXXX
│ └── with_eph
└── not_eph_rest
You can use full globs to filter full path results with pathlib and a generator for the top level directory:
i_want=(fn for fn in Path(path_to).glob('*') if not fn.match('**/*/eph*'))
>>> list(i_want)
[PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), PosixPath('/tmp/test/not_eph_rest'), PosixPath('/tmp/test/not_eph_dir')]
The pathlib method match uses globs to match a path object; The glob '**/*/eph*' is any full path that leads to a file with a name starting with 'eph'.
Alternatively, you can use the .name attribute with name.startswith('eph'):
i_want=(fn for fn in Path(path_to).glob('*') if not fn.name.startswith('eph'))
If you want only files, no directories:
i_want=(fn for fn in Path(path_to).glob('*') if fn.is_file() and not fn.match('**/*/eph*'))
# [PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'), PosixPath('/tmp/test/not_eph_rest')]
The same method works for recursive globs:
i_want=(fn for fn in Path(path_to).glob('**/*')
if fn.is_file() and not fn.match('**/*/eph*'))
# [PosixPath('/tmp/test/eee2314'), PosixPath('/tmp/test/asd3442'),
PosixPath('/tmp/test/not_eph_rest'), PosixPath('/tmp/test/eph_dir/asd330'),
PosixPath('/tmp/test/eph_dir/file_with_eph'), PosixPath('/tmp/test/eph_dir/exy123'),
PosixPath('/tmp/test/not_eph_dir/with_eph')]
As mentioned by the accepted answer, you can't exclude patterns with glob, so the following is a method to filter your glob result.
The accepted answer is probably the best pythonic way to do things but if you think list comprehensions look a bit ugly and want to make your code maximally numpythonic anyway (like I did) then you can do this (but note that this is probably less efficient than the list comprehension method):
import glob
data_files = glob.glob("path_to_files/*.fits")
light_files = np.setdiff1d( data_files, glob.glob("*BIAS*"))
light_files = np.setdiff1d(light_files, glob.glob("*FLAT*"))
(In my case, I had some image frames, bias frames, and flat frames all in one directory and I just wanted the image frames)
If the position of the character isn't important, that is for example to exclude manifests files (wherever it is found _) with glob and re - regular expression operations, you can use:
import glob
import re
for file in glob.glob('*.txt'):
if re.match(r'.*\_.*', file):
continue
else:
print(file)
Or with in a more elegant way - list comprehension
filtered = [f for f in glob.glob('*.txt') if not re.match(r'.*\_.*', f)]
for mach in filtered:
print(mach)
To exclude exact word you may want to implement custom regex directive, which you will then replace by empty string before glob processing.
#!/usr/bin/env python3
import glob
import re
# glob (or fnmatch) does not support exact word matching. This is custom directive to overcome this issue
glob_exact_match_regex = r"\[\^.*\]"
path = "[^exclude.py]*py" # [^...] is a custom directive, that excludes exact match
# Process custom directive
try: # Try to parse exact match direction
exact_match = re.findall(glob_exact_match_regex, path)[0].replace('[^', '').replace(']', '')
except IndexError:
exact_match = None
else: # Remove custom directive
path = re.sub(glob_exact_match_regex, "", path)
paths = glob.glob(path)
# Implement custom directive
if exact_match is not None: # Exclude all paths with specified string
paths = [p for p in paths if exact_match not in p]
print(paths)
import glob
import re
""" This is a path that should be excluded """
EXCLUDE = "/home/koosha/Documents/Excel"
files = glob.glob("/home/koosha/Documents/**/*.*" , recursive=True)
for file in files:
if re.search(EXCLUDE,file):
pass
else:
print(file)