Using glob to find duplicate filenames with the same number in it - python

I am currently writing a script that cycles through all the files in a folder and renames them according to a naming convention.
What I would like to achieve is the following; if the script finds 2 files that have the same number in the filename (e.g. '101 test' and '101 real') it will move those two files to a different folder named 'duplicates'.
My original plan was to use glob to cycle through all the files in the folder and add every file containing a certain number to a list. The list would then be checked in length, and if the length exceeded 1 (i.e. there are 2 files with the same number), then the files would be located to this 'duplicates' folder. However for some reason this does not work.
Here is my code, I was hoping someone with more experience than me can give me some insight into how to achieve my goal, Thanks!:
app = askdirectory(parent=root)
for x in range(804):
listofnames = []
real = os.path.join(app, '*{}*').format(x)
for name in glob.glob(real):
listofnames.append(name)
y = len(listofnames)
if y > 1:
for names in listofnames:
path = os.path.join(app, names)
shutil.move(path,app + "/Duplicates")

A simple way is to collect filenames with numbers in a structure like this:
numbers = {
101: ['101 test', '101 real'],
93: ['hugo, 93']
}
and if a list in this dict is longer than one do the move.
import re, os
from collections import defaultdict
app = askdirectory(parent=root)
# a magic dict
numbers = defaultdict(list)
# list all files in this dir
for filename in os.listdir(app):
# \d+ means a decimal number of any length
match = re.search('\d+', filename)
if match is None:
# no digits found
continue
#extract the number
number = int(match.group())
# defaultdict magic
numbers[number].append(filename)
for number, filenames in numbers.items():
if len(filenames) < 2:
# not a dupe
continue
for filename in filenames:
shutil.move(os.path.join(app, filename),
os.path.join(app, "Duplicates"))
defaultdict magic is just a short hand for the following code:
if number not in numbers:
numbers.append(list())
numbers[number] = filename

Related

Find file in directory with the highest number in the filename

My question is closely related to Python identify file with largest number as part of filename
I want to append files to a certain directory. The name of the files are: file1, file2......file^n. This works if i do it in one go, but when i want to add files again, and want to find the last file added (in this case the file with the highest number), it recognises 'file6' to be higher than 'file100'.
How can i solve this.
import glob
import os
latest_file = max(sorted(list_of_files, key=os.path.getctime))
print latest_file
As you can see i tried looking at created time and i also tried looking at modified time, but these can be the same so that doesn't help.
EDIT my filenames have the extention ".txt" after the number
I'll try to solve it only using filenames, not dates.
You have to convert to integer before appling criteria or alphanum sort applies to the whole filename
Proof of concept:
import re
list_of_files = ["file1","file100","file4","file7"]
def extract_number(f):
s = re.findall("\d+$",f)
return (int(s[0]) if s else -1,f)
print(max(list_of_files,key=extract_number))
result: file100
the key function extracts the digits found at the end of the file and converts to integer, and if nothing is found returns -1
you don't need to sort to find the max, just pass the key to max directly
if 2 files have the same index, use full filename to break tie (which explains the tuple key)
Using the following regular expression you can get the number of each file:
import re
maxn = 0
for file in list_of_files:
num = int(re.search('file(\d*)', file).group(1)) # assuming filename is "filexxx.txt"
# compare num to previous max, e.g.
maxn = num if num > maxn else maxn
At the end of the loop, maxn will be your highest filename number.

Python: moving file to a newly created directory

I've got my script creating a bunch of files (size varies depending on inputs) and I want to be certain files in certain folders based on the filenames.
So far I've got the following but although directories are being created no files are being moved, I'm not sure if the logic in the final for loop makes any sense.
In the below code I'm trying to move all .png files ending in _01 into the sub_frame_0 folder.
Additionally is their someway to increment both the file endings _01 to _02 etc., and the destn folder ie. from sub_frame_0 to sub_frame_1 to sub_frame_2 and so on.
for index, i in enumerate(range(num_sub_frames+10)):
path = os.makedirs('./sub_frame_{}'.format(index))
# Slice layers into sub-frames and add to appropriate directory
list_of_files = glob.glob('*.tif')
for fname in list_of_files:
image_slicer.slice(fname, num_sub_frames) # Slices the .tif frames into .png sub-frames
list_of_sub_frames = glob.glob('*.png')
for i in list_of_sub_frames:
if i == '*_01.png':
shutil.move(os.path.join(os.getcwd(), '*_01.png'), './sub_frame_0/')
As you said, the logic of the final loop does not make sense.
if i == '*_01.ng'
It would evaluate something like 'image_01.png' == '*_01.png' and be always false.
Regexp should be the way to go, but for this simple case you just can slice the number from the file name.
for i in list_of_sub_frames:
frame = int(i[-6:-4]) - 1
shutil.move(os.path.join(os.getcwd(), i), './sub_frame_{}/'.format(frame))
If i = 'image_01.png' then i[-6:-4] would take '01', convert it to integer and then just subtract 1 to follow your schema.
A simple fix would be to check if '*_01.png' is in the file name i and change the shutil.move to include i, the filename. (It's also worth mentioning that iis not a good name for a filepath
list_of_sub_frames = glob.glob('*.png')
for i in list_of_sub_frames:
if '*_01.png' in i:
shutil.move(os.path.join(os.getcwd(), i), './sub_frame_0/')
Additionally is [there some way] to increment both the file endings _01 to _02 etc., and the destn folder ie. from sub_frame_0 to sub_frame_1 to sub_frame_2 and so on.
You could create file names doing something as simple as this:
for i in range(10):
#simple string parsing
file_name = 'sub_frame_'+str(i)
folder_name = 'folder_sub_frame_'+str(i)
Here is a complete example using regular expressions. This also implements the incrementing of file names/destination folders
import os
import glob
import shutil
import re
num_sub_frames = 3
# No need to enumerate range list without start or step
for index in range(num_sub_frames+10):
path = os.makedirs('./sub_frame_{0:02}'.format(index))
# Slice layers into sub-frames and add to appropriate directory
list_of_files = glob.glob('*.tif')
for fname in list_of_files:
image_slicer.slice(fname, num_sub_frames) # Slices the .tif frames into .png sub-frames
list_of_sub_frames = glob.glob('*.png')
for name in list_of_sub_frames:
m = re.search('(?P<fname>.+?)_(?P<num>\d+).png', name)
if m:
num = int(m.group('num'))+1
newname = '{0}_{1:02}.png'.format(m.group('fname'), num)
newpath = os.path.join('./sub_frame_{0:02}/'.format(num), newname)
print m.group() + ' -> ' + newpath
shutil.move(os.path.join(os.getcwd(), m.group()), newpath)

How to identify files that have increasing numbers and a similar form of filename?

I have a directory of files, some of them image files. Some of those image files are a sequence of images. They could be named image-000001.png, image-000002.png and so on, or perhaps 001_sequence.png, 002_sequence.png etc.
How can we identify images that would, to a human, appear by their names to be fairly obviously in a sequence? This would mean identifying only those image filenames that have increasing numbers and all have a similar form of filename.
The similar part of the filename would not be pre-defined.
You can use a regular expression to get files adhering to a certain pattern, e.g. .*\d+.*\.(jpg|png) for anything, then a number, then more anything, and an image extension.
files = ["image-000001.png", "image-000002.png", "001_sequence.png",
"002_sequence.png", "not an image 1.doc", "not an image 2.doc",
"other stuff.txt", "singular image.jpg"]
import re
image_files = [f for f in files if re.match(r".*\d+.*\.(jpg|png)", f)]
Now, group those image files by replacing the number with some generic string, e.g. XXX:
patterns = collections.defaultdict(list)
for f in image_files:
p = re.sub("\d+", "XXX", f)
patterns[p].append(f)
As a result, patterns is
{'image-XXX.png': ['image-000001.png', 'image-000002.png'],
'XXX_sequence.png': ['001_sequence.png', '002_sequence.png']}
Similarly, it should not be too hard to check whether all those numbers are consecutive, but maybe that's not really necessary after all. Note, however, that this will have problems discriminating numbered series such as "series1_001.jpg", and "series2_001.jpg".
What I would suggest is to use regex trough files and group matching pattern with list of associated numbers from the file-name.
Once this is done, just loop trough the dictionnaries keys and ensure that count of elements is the same that the range of matched numbers.
import re
from collections import defaultdict
from os import listdir
files = listdir("/the/path/")
found_patterns = defaultdict(list)
p = re.compile("(.*?)(\d+)(.*)\.png")
for f in files:
if p.match(f):
s = p.search(f)
pattern = s.group(1) + "___" + s.group(3)
num = int(s.group(2))
found_patterns[pattern].append(num)
for pattern, found in found_patterns.items():
mini, maxi = min(found), max(found)
if len(found) == maxi - mini + 1:
print("Pattern correct: %s" % pattern)
Of course, this will not work if there are some missing value but you can use some acceptance error.

Rename files in a folder based on names in a different folder

I have 2 folders, each with the same number of files. I want to rename the files in folder 2 based on the names of the files in folder 1. So in folder 1there might be three files titled:
Landsat_1,
Landsat_2,
Landsat_3
and in folder 2 these files are called:
1,
2,
3
and I want to rename them based on folder 1 names. I thought about turning the item names of each folder into a a .txt file and then turning the .txt file in a list and then renaming but I'm not sure if this is the best way to do it. Any suggestions?
Edit:
I have simplified the file names above, so just appending with Landsat_ wil not work for me.
The real file names in folder 1 are more like LT503002011_band1, LT5040300201_band1, LT50402312_band4. In folder 2 they are extract1, extract2, extract3. There are 500 files in total and in folder 2 it is just a running count of extract and a number for each file.
As someone said, "sort each list and zip them together in order to rename".
Notes:
the key() function extracts all of the numbers so that sorted() can sort the lists numerically based on the embedded numbers.
we sort both lists: os.listdir() returns files in arbitrary order.
The for loop is a common way to use zip: for itemA, itemB in zip(listA, listB):
os.path.join() provides portability: no worries about / or \
A typical invocation on Windows: python doit.py c:\data\lt c:\data\extract, assuming those are directories you have described.
A typical invocation on *nix: : python doit.py ./lt ./extract
import sys
import re
import os
assert len(sys.argv) == 3, "Usage: %s LT-dir extract-dir"%sys.argv[0]
_, ltdir, exdir = sys.argv
def key(x):
return [int(y) for y in re.findall('\d+', x)]
ltfiles = sorted(os.listdir(ltdir), key=key)
exfiles = sorted(os.listdir(exdir), key=key)
for exfile,ltfile in zip(exfiles, ltfiles):
os.rename(os.path.join(exdir,exfile), os.path.join(exdir,ltfile))
You might want to use the glob package which takes a filename pattern and outputs it into a list. For example, in that directory
glob.glob('*')
gives you
['Landsat_1', 'Landsat_2', 'Landsat_3']
Then you can loop over the filenames in the list and change the filenames accordingly:
import glob
import os
folderlist = glob.glob('*')
for folder in folderlist:
filelist = glob.glob(folder + '*')
for fil in filelist:
os.rename(fil, folder + fil)
Hope this helps
I went for more completeness :D.
# WARNING: BACKUP your data before running this code. I've checked to
# see that it mostly works, but I would want to test this very well
# against my actual data before I trusted it with that data! Especially
# if you're going to be modifying anything in the directories while this
# is running. Also, make sure you understand what this code is expecting
# to find in each directory.
import os
import re
main_dir_demo = 'main_dir_path'
extract_dir_demo = 'extract_dir_path'
def generate_paths(directory, filenames, target_names):
for filename, target_name in zip(filenames, target_names):
yield (os.path.join(directory, filename),
os.path.join(directory, target_name))
def sync_filenames(main_dir, main_regex, other_dir, other_regex, key=None):
main_files = [f for f in os.listdir(main_dir) if main_regex.match(f)]
other_files = [f for f in os.listdir(other_dir) if other_regex.match(f)]
# Do not proceed if there aren't the same number of things in each
# directory; better safe than sorry.
assert len(main_files) == len(other_files)
main_files.sort(key=key)
other_files.sort(key=key)
path_pairs = generate_paths(other_dir, other_files, main_files)
for other_path, target_path in path_pairs:
os.rename(other_path, target_path)
def demo_key(item):
"""Sort by the numbers in a string ONLY; not the letters."""
return [int(y) for y in re.findall('\d+', item)]
def main(main_dir, extract_dir, key=None):
main_regex = re.compile('LT\d+_band\d')
other_regex = re.compile('extract\d+')
sync_filenames(main_dir, main_regex, extract_dir, other_regex, key=key)
if __name__ == '__main__':
main(main_dir_demo, extract_dir_demo, key=demo_key)

How to read filenames in a folder and access them in an alphabetical and increasing number order?

I would like to ask how to efficiently handle accessing of filenames in a folder in the right order (alphabetical and increasing in number).
For example, I have the following files in a folder: apple1.dat, apple2.dat, apple10.dat, banana1.dat, banana2.dat, banana10.dat. I would like to read the contents of the files such that apple1.dat will be read first and banana10.dat will be read last.
Thanks.
This is what I did so far.
from glob import glob
files=glob('*.dat')
for list in files
# I read the files here in order
But as pointed out, apple10.dat comes before apple2.dat
from glob import glob
import os
files_list = glob(os.path.join(my_folder, '*.dat'))
for a_file in sorted(files_list):
# do whatever with the file
# 'open' or 'with' statements depending on your python version
try this one.
import os
def get_sorted_files(Directory)
filenamelist = []
for root, dirs, files in os.walk(Directory):
for name in files:
fullname = os.path.join(root, name)
filenamelist.append(fullname)
return sorted(filenamelist)
You have to cast the numbers to an int first. Doing it the long way would require breaking the names into the strings and numbers, casting the numbers to an int and sorting. Perhaps someone else has a shorter or more efficient way.
def split_in_two(str_in):
## go from right to left until a letter is found
## assume first letter of name is not a digit
for ctr in range(len(str_in)-1, 0, -1):
if not str_in[ctr].isdigit():
return str_in[:ctr+1], str_in[ctr+1:] ## ctr+1 = first digit
## default for no letters found
return str_in, "0"
files=['apple1.dat', 'apple2.dat', 'apple10.dat', 'apple11.dat',
'banana1.dat', 'banana10.dat', 'banana2.dat']
print sorted(files) ## sorted as you say
sort_numbers = []
for f in files:
## split off '.dat.
no_ending = f[:-4]
str_1, str_2 = split_in_two(no_ending)
sort_numbers.append([str_1, int(str_2), ".dat"])
sort_numbers.sort()
print sort_numbers

Categories

Resources