File name matching - middle of the string - python

I have a directory with files that follow the format: LnLnnnnLnnn.txt
where L = letters and n = numbers. E.g: p2c0789c001.txt
I would like to separate these files based on whether the second number (i.e. 0789) is odd or even.
I've only managed to get this to work if the second number ranges between 0001-0009 using the code:
odd_files = []
for root, dirs, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, 'p2c000[13579]*.txt'):
odd_files.append(os.path.join(root, filename))
This will return the files: ['./p2c0001c054.txt', './p2c0003c055.txt', './p2c0005c056.txt', './p2c0007c057.txt', './p2c0009c058.txt']
Any suggestion how could I get this to work for any given four digit number?

The easiest solution would be to expand your wildcard to match a wider array of things.
to that end I would probably do something like:
for filename in fnmatch.filter(filenames, '??????[13579]*.txt'):
This will match any characters before your values, it will match any of the odd values in your wildcard class and then it will accept anything to match afterwards.
This is a bit gross because as it is aaaaaaaa3alkjfdhalkjfshglkjzsdhfgs.txt would match and that is super gross. If you know that the data in the directories you are walking is well controlled that might be ok. A better solution might be to specify things a bit more. This could be done with the following expression:
'[a-z][0-0][a-z][0-9][0-9][0-9][13579][a-z][0-9][0-9][0-9].txt'
The fnmatch.filter method using Unix style wildcards. That means you can use the following:
? - match any single character
* - matches anything from nothing to everything
[] - this matches a class of things, use a - for a range and ! for exclusion

Would this do it?
import re
regex = re.compile("[a-z][0-9][a-z]([0-9]{4})[a-z][0-9]{3}.txt")
filter(lambda x: int(regex.match(x).groups()[0]) % 2 == 1, fnmatch)

If it's getting a little hairy, you could always turn that into a generator and code the tests by hand:
def odd_files_generator():
for root, dirs, filenames in os.walk('.'):
for filename in filenames:
if filename[6] in '13579':
yield filename
odd_files = list(odd_files_generator)
If your test is growing exceedingly hard to express tersely, replace the if filename ... line with your explicit test code.

There's no particular magic to constructing this kind of filter. It just
requires carefully constructing the appropriate regular expression and testing
against it. When using complex patterns with a lot of repetitive components,
errors can easily creep in. I like to define helper functions that make the
specification more human-readable and easier to modify later if need be.
import re
import os
# helper functions for legible re construction
LETTER = lambda n='': '({0}{1})'.format('[A-Za-z]', n)
NUM = lambda n='': '({0}{1})'.format('\d', n)
FILENAME = LETTER() + NUM() + LETTER() + NUM('{4}') + LETTER() + NUM('{3}') + '\.txt'
FILENAME_RE = re.compile(FILENAME)
is_odd = lambda n: int(n) % 2 > 0
def odd_nnnn(f):
"""
Determine if the given filename `f` matches our desired LnLnnnnLnnn.txt pattern
with the second group of numbers (nnnn) odd.
"""
m = FILENAME_RE.search(f)
return m is not None and is_odd(m.group(4))
if __name__ == '__main__':
print "Search pattern:", FILENAME
files = ['./p2c0001c054.txt', './p2c0001c055.txt', './p2c0003c055.txt', './p2c0005c056.txt', './p2c0022c056.txt', './p2c0004c056.txt', './p2c0007c057.txt', './p2c0009c058.txt', './p2c8888c056.txt', ]
files = [ os.path.normpath(f) for f in files ]
root = '/users/test/whatever'
odd_paths = [ os.path.join(root, f) for f in files if odd_nnnn(f) ]
print odd_paths
The only real downside to this is that it's a little more verbose, especially compared to a hyper-compact answer like Brad Beattie's.
[Update] It later occurred to me that a more compact way to define the regular expression might be:
FILENAME = "LnL(nnnn)Lnnn\.txt"
FILENAME_PAT = FILENAME.replace('L', r'[A-Za-z]').replace('n', r'\d')
FILENAME_RE = re.compile(FILENAME_PAT)
This more closely follows the original 'LnLnnnLnnn.txt' description. The match expression would have to change from m.group(4) to m.group(1), because just one group is captured this way.

Related

Python: how to search for specific "string" in directory name (not individual file names)

I want to create a list of all the filepath names that match a specific string e.g. "04_DEM" so I can do further processing on the files inside those directories?
e.g.
INPUT
C:\directory\NewZealand\04DEM\DEM_CD23_1232.tif
C:\directory\Australia\04DEM\DEM_CD23_1233.tif
C:\directory\NewZealand\05DSM\DSM_CD23_1232.tif
C:\directory\Australia\05DSM\DSM_CD23_1232.tif
WANTED OUTPUT
C:\directory\NewZealand\04DEM\
C:\directory\Australia\04DEM\
This makes sure that only those files are processed, as some other files in the directories also have the same string "DEM" included in their filename, which I do not want to modify.
This is my bad attempt due to being a rookie with Py code
import os
for dirnames in os.walk('D:\Canterbury_2017Copy'):
print dirnames
if dirnames=='04_DEM' > listofdirectoriestoprocess.txt
print "DONE CHECK TEXT FILE"
You can use os.path for this:
import os
lst = [r'C:\directory\NewZealand\04DEM\DEM_CD23_1232.tif',
r'C:\directory\Australia\04DEM\DEM_CD23_1233.tif',
r'C:\directory\NewZealand\05DSM\DSM_CD23_1232.tif',
r'C:\directory\Australia\05DSM\DSM_CD23_1232.tif']
def filter_paths(lst, x):
return [os.path.split(i)[0] for i in lst if os.path.normpath(i).split(os.sep)[3] == x]
res = list(filter_paths(lst, '04DEM'))
# ['C:\\directory\\NewZealand\\04DEM',
# 'C:\\directory\\Australia\\04DEM']
Use in to check if a required string is in another string.
This is one quick way:
new_list = []
for path in path_list:
if '04DEM' in path:
new_list.append(path)
Demo:
s = 'C:/directory/NewZealand/04DEM/DEM_CD23_1232.tif'
if '04DEM' in s:
print(True)
# True
Make sure you use / or \\ as directory separator instead of \ because the latter escapes characters.
First, you select via regex using re, and then use pathlib:
import re
import pathlib
pattern = re.compile('04DEM')
# You use pattern.search() if s is IN the string
# You use pattern.match() if s COMPLETELY matches the string.
# Apply the correct function to your use case.
files = [s in list_of_files if pattern.search(s)]
all_pruned_paths = set()
for p in files:
total = ""
for d in pathlib.Path(p):
total = os.path.join(total, d)
if pattern.search(s):
break
all_pruned_paths.add(total)
result = list(all_pruned_paths)
This is more robust than using in because you might need to form more complicated queries in the future.

how to cut the end of a string by some condition in python?

I have searched possible ways but I am unable to mix those up yet. I have a string that is a path to the image.
myString= "D:/Train/16_partitions_annotated/partition1/images/AAAAA/073-1_00191.jpeg"
What I want to do is replace images with IMAGES and cut off the 073-1_00191.jpeg part at the end. Thus, the new string string should be
newString = "D:/Train/16_partitions_annotated/partition1/IMAGES/AAAAA/"
And the chopped part (073-1_00191.jpeg) will be used separately as the name of processed image. The function .replace() doesn't work here as I need to provide path and filename as separate parameters.
The reason why I want to do is that I am accessing images through their paths and doing some stuff on them and when saving them I need to create another directory (in this case IMAGES) and the next directories after that (in this case AAAAA) should remain the same ( together with the name of corresponding image).
Note that images may have different names and extensions
If something is not clear by my side please ask, I will try to clear up
As alluded to in the comments, os.path is useful for manipulating paths represented as strings.
>>> import os
>>> myString= "D:/Train/16_partitions_annotated/partition1/images/AAAAA/073-1_00191.jpeg"
>>> dirname, basename = os.path.split(myString)
>>> dirname
'D:/Train/16_partitions_annotated/partition1/images/AAAAA'
>>> basename
'073-1_00191.jpeg'
At this point, how you want to handle capitalizing "images" is a function of your broader goal. If you want to simply capitalize that specific word, dirname.replace('images', 'IMAGES') should suffice. But you seem to be asking for a more generalized way to capitalize the second to last directory in the absolute path:
>>> def cap_penultimate(dirname):
... h, t = os.path.split(dirname)
... hh, ht = os.path.split(h)
... return os.path.join(hh, ht.upper(), t)
...
>>> cap_penultimate(dirname)
'D:/Train/16_partitions_annotated/partition1/IMAGES/AAAAA'
It's game of slicing , Here you can try this :
myString= "D:/Train/16_partitions_annotated/partition1/images/AAAAA/073-1_00191.jpeg"
myString1=myString.split('/')
pre_data=myString1[:myString1.index('images')]
after_data=myString1[myString1.index('images'):]
after_data=['IMAGE'] + after_data[1:2]
print("/".join(pre_data+after_data))
output:
D:/Train/16_partitions_annotated/partition1/IMAGE/AAAAA
The simple way :
myString= "D:/Train/16_partitions_annotated/partition1/images/AAAAA/073-1_00191.jpeg"
a = myString.rfind('/')
filename = myString[a+1:]
restofstring = myString[0:a]
alteredstring = restofstring.replace('images', 'IMAGES')
print(alteredstring)
output:
D:/Train/16_partitions_annotated/partition1/IMAGE/AAAAA

How to perform a case-insensitive search for files of a given suffix?

I'm looking for the equivalent of find $DIR -iname '*.mp3', and I don't want to do the kooky ['mp3', 'Mp3', MP3', etc] thing. But I can't figure out how to combine the re*.IGNORECASE stuff with the simple endswith() approach. My goal is to not miss a single file, and I'd like to eventually expand this to other media/file types/suffixes.
import os
import re
suffix = ".mp3"
mp3_count = 0
for root, dirs, files in os.walk("/Volumes/audio"):
for file in files:
# if file.endswith(suffix):
if re.findall('mp3', suffix, flags=re.IGNORECASE):
mp3_count += 1
print(mp3_count)
TIA for any feedback
Don't bother with os.walk. Learn to use the easier, awesome pathlib.Path instead. Like so:
from pathlib import Path
suffix = ".mp3"
mp3_count = 0
p = Path('Volumes')/'audio': # note the easy path creation syntax
# OR even:
p = Path()/'Volumes'/'audio':
for subp in p.rglob('*'): # recursively iterate all items matching the glob pattern
# .suffix property refers to .ext extension
ext = subp.suffix
# use the .lower() method to get lowercase version of extension
if ext.lower() == suffix:
mp3_count += 1
print(mp3_count)
"One-liner", if you're into that sort of thing (multiple lines for clarity):
sum(1 for subp in (Path('Volumes')/'audio').rglob('*')
if subp.suffix.lower() == suffix)
You can try this :)
import os
# import re
suffix = "mp3"
mp3_count = 0
for root, dirs, files in os.walk("/Volumes/audio"):
for file in files:
# if file.endswith(suffix):
if file.split('.')[-1].lower() == suffix:
mp3_count += 1
print(mp3_count)
Python's string.split() will separate the string into a list, depending on what parameter is given, and you can access the suffix by [-1], the last element in the list
The regex equivalent of .endswith is the $ sign.
To use your example above, you could do this;
re.findall('mp3$', suffix, flags=re.IGNORECASE):
Though it might be more accurate to do this;
re.findall(r'\.mp3$', suffix, flags=re.IGNORECASE):
which makes sure that the filename ends with .mp3 rather than picking up files such as test.amp3.
This is a pretty good example of a situation that doesn't really require regex - so while you're welcome to learn from these examples, it's worth considering the alternatives provided by other answerers.

How to identify files that have increasing numbers and a similar form of filename?

I have a directory of files, some of them image files. Some of those image files are a sequence of images. They could be named image-000001.png, image-000002.png and so on, or perhaps 001_sequence.png, 002_sequence.png etc.
How can we identify images that would, to a human, appear by their names to be fairly obviously in a sequence? This would mean identifying only those image filenames that have increasing numbers and all have a similar form of filename.
The similar part of the filename would not be pre-defined.
You can use a regular expression to get files adhering to a certain pattern, e.g. .*\d+.*\.(jpg|png) for anything, then a number, then more anything, and an image extension.
files = ["image-000001.png", "image-000002.png", "001_sequence.png",
"002_sequence.png", "not an image 1.doc", "not an image 2.doc",
"other stuff.txt", "singular image.jpg"]
import re
image_files = [f for f in files if re.match(r".*\d+.*\.(jpg|png)", f)]
Now, group those image files by replacing the number with some generic string, e.g. XXX:
patterns = collections.defaultdict(list)
for f in image_files:
p = re.sub("\d+", "XXX", f)
patterns[p].append(f)
As a result, patterns is
{'image-XXX.png': ['image-000001.png', 'image-000002.png'],
'XXX_sequence.png': ['001_sequence.png', '002_sequence.png']}
Similarly, it should not be too hard to check whether all those numbers are consecutive, but maybe that's not really necessary after all. Note, however, that this will have problems discriminating numbered series such as "series1_001.jpg", and "series2_001.jpg".
What I would suggest is to use regex trough files and group matching pattern with list of associated numbers from the file-name.
Once this is done, just loop trough the dictionnaries keys and ensure that count of elements is the same that the range of matched numbers.
import re
from collections import defaultdict
from os import listdir
files = listdir("/the/path/")
found_patterns = defaultdict(list)
p = re.compile("(.*?)(\d+)(.*)\.png")
for f in files:
if p.match(f):
s = p.search(f)
pattern = s.group(1) + "___" + s.group(3)
num = int(s.group(2))
found_patterns[pattern].append(num)
for pattern, found in found_patterns.items():
mini, maxi = min(found), max(found)
if len(found) == maxi - mini + 1:
print("Pattern correct: %s" % pattern)
Of course, this will not work if there are some missing value but you can use some acceptance error.

Rename a group of files in python

I'm trying to rename some files in a directory using Python. I've looked around the forums here, and because I'm a newbie, I can't adapt what I need from what is out there.
Say in a directory I have a group of files called
FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602125225754_7_S110472_I238620.jpg
FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602125236347_8_S110472_I238620.jpg
FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602125236894_5_S110472_I238621.jpg
FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602125248691_6_S110472_I238621.jpg
and I want to remove "125225754", "125236347", "125236894" and "125248691" here so my resulting filename will be
FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602_7_S110472_I238620.jpg
FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602_8_S110472_I238620.jpg
FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602_5_S110472_I238621.jpg
FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602_6_S110472_I238621.jpg
I'm trying to use the os.path.split but it's not working properly.
I have also considered using string manipulations, but have not been successful with that either.
Any help would be greatly appreciated. Thanks.
os.path.split splits a path (/home/mattdmo/work/projects/python/2014/website/index.html) into its component directories and file name.
As #wim suggested, if the file names are all exactly the same length, you can use string slicing to split out whatever occurs between two indexes, then join them back together. So, in your example,
filename = "FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602125248691_6_S110472_I238621.jpg"
newname = filename[:57] + filename[66:]
print(newname)
# FILENAME_002_S_0295_MR_3_Plane_Localizer__br_raw_20110602_6_S110472_I238621.jpg
This takes the first 58 characters of the string (remember in Python string indexes are 0-based) and joins it to all characters after the 67 one.
Now that you can do this, just put all the filenames into a list and iterate over it to get your new filenames:
import os
filelist = os.listdir('.') # get files in current directory
for filename in filelist:
if ".jpg" in filename: # only process pictures
newname = filename[:57] + filename[66:]
print(filename + " will be renamed as " + newname)
os.rename(filename, newname)
Can we assume that the files are all the same name up to the date _20110602[difference here]?
If that's the case then it's actually fairly easy to do.
First you need the index of that difference. Starting from character 0 which is 'F' in this case, count right until you hit that first difference. You can programatically do this by this:
s1 = 'String1'
s2 = 'String2'
i = 0
while(i < len(s1) && i < len(s2)):
if(s1[i] == s2[i]) i++
else break
And i is now set to the first difference of s1 and s2 (or if there is none, their length).
From here you know that you want to strip everything from this index to the following _.
j = i
while(j < len(s1)):
if(s1[j] != '_') j++
else break
# j is the index of the _ character after i
p1 = s1[:i] # Everything up to i
p2 = s1[j:] # Everything after j
s1 = p1.concat(p2)
# Do the same for s2, or even better, do this in a loop.
The only caveat here is that they have to be the same name up to this point for this to work. If they are the same length then this is still fairly easy, but you have to figure out yourself what the indices are rather than using the string difference method.
If you always have exact string: '20110602' in the file names stored in 'my_directory' folder:
import re #for regular expression
from os import rename
from glob import glob
for filename in glob('my_directory/*.jpg'):
match = re.search('20110602', filename)
if match:
newname = re.sub(r'20110602[0-9]+_','20110602_', filename)
rename(filename, newname)
A more general code to match any YYYYMMDD (or YYYYDDMM):
import re #for regular expression
from os import rename
from glob import glob
for filename in glob('my_directory/*.jpg'):
match = re.search(r'\d{4}\d{2}\d{2}\d+_', filename)
if match:
newname = re.sub(r'(\d{4}\d{2}\d{2})(\d+)(_)', '\\1'+'\\3', filename)
rename(filename, newname)
'\\1': This is match.group(1) that refers to the first set of parentheses
'\\3': This is match.group(3) that refers to the third set of parentheses
\d or [0-9]: are the same. They match any digit
{number}: the number of times the previous token (in this case a digit) are repeated
+ : 1 or more of previous expression (in this case a digit)

Categories

Resources