How to compare two files from filelist using regex?

How to compare two files from filelist using regex? - python

The file is reading from a folder with os.listdir. After I entered regex of the file r'^[1-9\w]{2}_[1-9\w]{4}[1][7][\d\w]+\.[\d\w]+' and the similar for another file r'^[1-9\w]{2}_[1-9\w]{4}[1][8]+' . The condition of the comparison is that when the first seven symbols are matching then os.remove(os.path.join(dir_name, each)) . Example of a few: bh_txbh171002.xml, bh_txbh180101.xml, ce_txce170101.xml...
As I understood we can't use match because there's no any string and it returns None, moreover it compares file with regex only. I am thinking about the condition if folder.itself(file) and file.startswitch("......."): But can't figure out how could I point the first seven symbols of file names what should be compared.
Honestly, I've placed my worse version of the code in that request and since that time I learnt a little bit more: the link - press to check it up

Regex is the wrong tool here I do not have your files so I create randomized demodata:
import random
import string
random.seed(42) # make random repeatable
def generateFileNames(amount):
"""Generate 2*amount of names XX_XXXX with X in [a-zA-T0-9] with duplicates in it"""
def rndName():
"""generate one random name XX_XXXX with X in [a-zA-T0-9]"""
characters = string.ascii_lowercase + string.digits
return random.choices(characters,k=2)+['_']+random.choices(characters,k=4)
for _ in range(amount): # create 2*amount names, some duplicates
name = rndName()
yield ''.join(name) # yield name once
if random.randint(1,10) > 3: # more likely to get same names twice
yield ''.join(name) # same name twice
else:
yield ''.join(rndName()) # different 2nd name
def generateNumberParts(amount):
"""Generate 2*amount of 6-digit-strings, some with 17+18 as starting numbers"""
def rndNums(nr):
"""Generate nr digits as string list"""
return random.choices(string.digits,k=nr)
for _ in range(amount):
choi = rndNums(4)
# i am yielding 18 first to demonstrate that sorting later works
yield ''.join(['18']+choi) # 18xxxx numbers
if random.randint(1,10) > 5:
yield ''.join(['17']+choi) # 17xxxx
else:
yield ''.join(rndNums(6)) # make it something other
# half the amount of files generated
m = 10
# generate filenames
filenames = [''.join(x)+'.xml' for x in zip(generateFileNames(m),
generateNumberParts(m)]
Now I have my names as list and can start to find out which are dupes with newer timestamps:
# make a dict out of your filenames, use first 7 as key
# with list of values of files starting with this key a values:
fileDict={}
for names in filenames:
fileDict.setdefault(names[0:7],[]).append(names) # create key=[] or/and append names
for k,v in fileDict.items():
print (k, " " , v)
# get files to delete (all the lower nr of the value-list if multiple in it)
filesToDelete = []
for k,v in fileDict.items():
if len(v) == 1: # nothing to do, its only 1 file
continue
print(v, " to ", end = "" ) # debugging output
v.sort(key = lambda x: int(x[7:9])) # sort by a lambda that integerfies 17/18
print (v) # debugging output
filesToDelete.extend(v[:-1]) # add all but the last file to the delete list
print("")
print(filesToDelete)
Output:
# the created filenames in your dict by "key [values]"
xa_ji0y ['xa_ji0y188040.xml', 'xa_ji0y501652.xml']
v3_a3zm ['v3_a3zm181930.xml']
mm_jbqe ['mm_jbqe171930.xml']
ck_w5ng ['ck_w5ng180679.xml', 'ck_w5ng348136.xml']
zy_cwti ['zy_cwti184296.xml', 'zy_cwti174296.xml']
41_iblj ['41_iblj182983.xml', '41_iblj172983.xml']
5x_ff0t ['5x_ff0t187453.xml']
sd_bdw2 ['sd_bdw2177453.xml']
vn_vqjt ['vn_vqjt189618.xml', 'vn_vqjt179618.xml']
ep_q85j ['ep_q85j185198.xml', 'ep_q85j175198.xml']
vf_1t2t ['vf_1t2t180309.xml', 'vf_1t2t089040.xml']
11_ertj ['11_ertj188425.xml', '11_ertj363842.xml']
# sorting the names by its integer at 8/9 position of name
['xa_ji0y188040.xml','xa_ji0y501652.xml'] to ['xa_ji0y188040.xml','xa_ji0y501652.xml']
['ck_w5ng180679.xml','ck_w5ng348136.xml'] to ['ck_w5ng180679.xml','ck_w5ng348136.xml']
['zy_cwti184296.xml','zy_cwti174296.xml'] to ['zy_cwti174296.xml','zy_cwti184296.xml']
['41_iblj182983.xml','41_iblj172983.xml'] to ['41_iblj172983.xml','41_iblj182983.xml']
['vn_vqjt189618.xml','vn_vqjt179618.xml'] to ['vn_vqjt179618.xml','vn_vqjt189618.xml']
['ep_q85j185198.xml','ep_q85j175198.xml'] to ['ep_q85j175198.xml','ep_q85j185198.xml']
['vf_1t2t180309.xml','vf_1t2t089040.xml'] to ['vf_1t2t089040.xml','vf_1t2t180309.xml']
['11_ertj188425.xml','11_ertj363842.xml'] to ['11_ertj188425.xml','11_ertj363842.xml']
# list of files to delete
['xa_ji0y188040.xml', 'ck_w5ng180679.xml', 'zy_cwti174296.xml', '41_iblj172983.xml',
'vn_vqjt179618.xml', 'ep_q85j175198.xml', 'vf_1t2t089040.xml', '11_ertj188425.xml']

I can't understand what's wrong with my code. There I defined the list from certain folder, so that I could work at the strings in each file, right? Then I applied the conditions for filtering and further choice of the one file to delete.
import os
dir_name = "/Python/Test_folder/Schems"
filenames = os.listdir(dir_name)
for names in filenames:
filenames.setdefault(names[0:7],[]).append(names) # create key=[] or/and append names
for k,v in filenames.items():
filesToDelete = [] #ther's a syntax mistake. But I can't get it - there's the list or not?
for k,v in filenames.items():
if len(v) == 1:
continue
v.sort(key = lambda x: int(x[7:9]))
filesToDelete.extend(v[:-1])

Related

Store smallest number from a list based on criteria

I have a list of file names as strings where I want to store, in a list, the file name with the minimum ending number relative to file names that have the same beginning denotation.
Example: For any file names in the list beginning with '2022-04-27_Cc1cPL3punY', I'd only want to store the file name with the minimum value of the number at the end. In this case, it would be the file name with 2825288523641594007, and so on for other beginning denotations.
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']

Given that your files would already be sorted in ascending order form your OS/file-manager, you can just find the first one from each common prefix
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
prefix_old = None
prefix = None
for f in files:
parts = f.split('_', 2)
prefix = '_'.join(parts[:2])
if prefix != prefix_old:
value = parts[2].split('.')[0]
print(f'Min value with prefix {prefix} is {value}')
prefix_old = prefix
Output
Min value with prefix 2022-04-27_Cc1a6yWpUeQ is 2825282726106954381
Min value with prefix 2022-04-27_Cc1cPL3punY is 2825288523641594007
Min value with prefix 2022-04-27_Cc1dN3Rp0es is 2825292832374568690

It seems that the list of files you have is already sorted according to groups of prefixes, and then according to the numbers. If that's indeed the case, you just need to take the first path of each prefix group. This can be done easily with itertools.groupby:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", next(group))
If you can't rely that they are internally ordered, find the minimum of each group according to the number:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))
And if you can't even rely that it's ordered by groups, just sort the list beforehand:
files.sort(key=lambda file: file.rsplit('_', 1)[0])
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))

If the same pattern is being followed, you can try to split each name by a separator (In your example '.' and '_'. Documentation on how split works here), and then sort that list by sorting a list of lists, as explained here. This will need to be done per each "ID", as I will call each group identifier's, so we'll first need to get the unique IDs, and then iterate them. After that, we can proceed with the splitting. By doing this, you'll get a list of lists with the complete file name in position 0, and the number from the suffix in position 1
prefix = list(set([pre.split('_')[1] for pre in names]))
names_split = []
for pre in prefix:
names_split.append([pre,[[name, name.split('.')[0].split('_')[2]] for name in names if name.split('_')[1] == pre]])
for i in range(len(prefix)):
names_split[i][1] =sorted(names_split[i][1], key=lambda x: int(x[1]))
print(names_split)
The file you need should be names_split[x][0][0] where x identifies each ID.
PS: If you need to find a particular ID, you can use
searched_index = [value[0] for value in names_split].index(ID)
and then names_split[searched_index][0][0]]
Edit: Changed the splitted characters order and added docs on split method
Edit 2: Added prefix grouping

Your best bet is probably to use the pandas library, it is very good at dealing with tabular data.
import pandas as pd
file_name_list = [] # Fill in here
file_name_list = [file_name[:-4] for file_name in file_name_list] # Get rid of .jpg
file_name_series = pd.Series(file_name_list) # Put the data in pandas
file_name_table = file_name_series.str.split("_", expand=True) # Split the strings
file_name_table.columns = ['date', 'prefix', 'number'] # Renaming for readability
file_name_table['number'] = file_name_table['number'].astype(int)
smallest_file_names = file_name_table.groupby(by=['date', 'prefix'])['number'].min()
smallest_file_names_list = smallest_file_names.to_list()
smallest_file_names_list = [file_name+'.jpg' for file_name in smallest_file_names_list] # Putting the .jpg back

Find file in directory with the max number given a set of different file names

Problem Description
I have a list of files ["FA_1","FA_2","FB_1","FB_2","FC_1","FC_2"]. That list has 3 different file names FA, FB and FC. For each of FA, FB and FC, I am trying to retrieve the one with the max number. The following script that I coded does that. But it's so complicated and ugly.
Is there a way to make it simpler?
A similar question was asked in Find file in directory with the highest number in the filename. But, they are only using the same file name.
#!/usr/bin/env python
import sys
import os
from collections import defaultdict
def load_newest_files():
# Retrieve all files for the component in question
list_of_files = defaultdict(list)
new_list_of_files = []
files = ["FA_1","FA_2","FB_1","FB_2","FC_1","FC_2"]
# Split files and take the filename without the id.
# The files are not separated in bucket of FA, FB and FC
# I can now retrieve the file with the max number and put
# it in a list
for file in files:
list_of_files[file.split("_")[0]].append(file)
for key,value in list_of_files.items():
new_list_of_files.append(max(value))
print(new_list_of_files)
def main():
load_newest_files()
if __name__ == "__main__":
main()

You can use itertools.groupby and create custom grouping and maximum functions for the key arguments. Example is shown below.
from itertools import groupby
def custom_group(item):
x, _ = item.split("_")
return x
def custom_max(item):
_, y = item.split("_")
return int(y)
for _, v in groupby(files, key=custom_group):
val = max(v, key=custom_max)
new_list_of_files.append(val)
print(new_list_of_files)
> ['FA_2', 'FB_2', 'FC_2']
Please make sure to read the caveats surrounding itertools.groupby regarding the sort order of your input data.

You can use the regex library and sort(). An example is shown below.
import re
def load_newest_files():
files = ["FA_1", "FA_2", "FB_1", "FB_2", "FC_1", "FC_2"]
# Sort the list
files.sort()
concat_files = " ".join(files)
a = dict(re.findall('(.*?)_([0-9])[ ]?', concat_files))
new_list_of_files = ["%s_%s" % (i, j) for i, j in a.items()]
return new_list_of_files
def main():
newest_files = load_newest_files()
print(newest_files)
if __name__ == "__main__":
main()

Why do you think it is complicated and ugly?
You could use a list comprehension instead of these 3 lines:
new_list_of_files = []
# [...]
for key,value in list_of_files.items():
new_list_of_files.append(max(value))
Like so:
new_list_of_files = [max(value) for value in list_of_files.values()]
Alternatively you can sort the list of files in reverse, then iterate over the list, adding only the first instance (which will be the highest) of each filename prefix to a new list, using a set to keep track of what filename prefixes have already been added.
files = ["FA_1", "FA_2", "FB_1", "FB_2", "FC_1", "FC_2"]
files.sort(reverse=True)
already_seen = set()
new_filenames = []
for file in files:
prefix = file.split("_")[0]
if prefix not in already_seen:
already_seen.add(prefix)
new_filenames.append(file)
print(new_filenames)
Output: ['FC_2', 'FB_2', 'FA_2']
You can get it down to 2 lines with a complicated and ugly list comprehension:
files = ["FA_1", "FA_2", "FB_1", "FB_2", "FC_1", "FC_2"]
already_seen = set()
new_filenames = [(file, already_seen.add(prefix))[0] for file in files[::-1] if (prefix := file.split("_")[0]) not in already_seen]
print(new_filenames)

Python: Retrieving and renaming indexed files in a directory

I created a script to rename indexed files in a given directory
e.g If the directory has the following files >> (bar001.txt, bar004.txt, bar007.txt, foo2.txt, foo5.txt, morty.dat, rick.py). My script should be able to rename 'only' the indexed files and close gaps like this >> (bar001.txt, bar002.txt, bar003.txt, foo1.txt, foo2.txt...).
I put the full script below which doesn't work. The error is logical because no error messages are given but files in the directory remain unchanged.
#! python3
import os, re
working_dir = os.path.abspath('.')
# A regex pattern that matches files with prefix,numbering and then extension
pattern = re.compile(r'''
^(.*?) # text before the file number
(\d+) # file index
(\.([a-z]+))$ # file extension
''',re.VERBOSE)
# Method that renames the items of an array
def rename(array):
for i in range(len(array)):
matchObj = pattern.search(array[i])
temp = list(matchObj.group(2))
temp[-1] = str(i+1)
index = ''.join(temp)
array[i] = matchObj.group(1) + index + matchObj.group(3)
return(array)
array = []
directory = sorted(os.listdir('.'))
for item in directory:
matchObj = pattern.search(item)
if not matchObj:
continue
if len(array) == 0 or matchObj.group(1) in array[0]:
array.append(item)
else:
temp = array
newNames = rename(temp)
for i in range(len(temp)):
os.rename(os.path.join(working_dir,temp[i]),
os.path.join(working_dir,newNames[i]))
array.clear() #reset array for other files
array.append(item)

To summarise, you want to find every file whose name ends with a number and
fill in the gaps for every set of files that have the same name, save for the number suffix. You don't want to create any new files; rather, the ones with the highest numbers should be used to fill the gaps.
Since this summary translates rather nicely into code, I will do so rather than working off of your code.
import re
import os
from os import path
folder = 'path/to/folder/'
pattern = re.compile(r'(.*?)(\d+)(\.[a-z]+)$')
summary = {}
for fn in os.listdir(folder):
m = pattern.match(fn)
if m and path.isfile(path.join(folder, fn)):
# Create a key if there isn't one, add the 'index' to the set
# The first item in the tuple - len(n) - tells use how the numbers should be formatted later on
name, n, ext = m.groups()
summary.setdefault((name, ext), (len(n), set()))[1].add(int(n))
for (name, ext), (n, current) in summary.items():
required = set(range(1, len(current)+1)) # You want these
gaps = required - current # You're missing these
superfluous = current - required # You don't need these, so they should be renamed to fill the gaps
assert(len(gaps) == len(superfluous)), 'Something has gone wrong'
for old, new in zip(superfluous, gaps):
oldname = '{name}{n:>0{pad}}{ext}'.format(pad=n, name=name, n=old, ext=ext)
newname = '{name}{n:>0{pad}}{ext}'.format(pad=n, name=name, n=new, ext=ext)
print('{old} should be replaced with {new}'.format(old=oldname, new=newname))
That about covers it I think.

Filename string comparison in list search fails [Python]

I am trying to associate some filepaths from 2 list elements in Python. These files have a part of their name identical, while the extension and some extra words are different.
This means the extension of the file, extra characters and their location can differ. The files are in different folders, hence their filepath name differs. What is exactly equal: their Numbering index: 0033, 0061 for example.
Example code:
original_files = ['C:/0001.jpg',
'C:/0033.jpg',
'C:/0061.jpg',
'C:/0080.jpg',
'C:/0204.jpg',
'C:/0241.jpg']
related_files = ['C:/0001_PM.png',
'C:/0033_PMA.png',
'C:/0033_NM.png',
'C:/0061_PMLTS.png',
'C:/0080_PM.png',
'C:/0080_RS.png',
'C:/0204_PM.png']
for idx, filename in enumerate(original_files):
related_filename = [s for s in (related_files) if filename.rsplit('/',1)[1][:-4] in s]
print(related_filename)
At filename = 'C:/0241.jpg' it should return [], but instead it returns all the filenames from related_files.
For privacy reasons I didn't post the entire filepath, just the names of the files. In this example, the comparison works, but for the entire filepath it fails.
I suppose my comparison condition is not correct but I don't know how to write it.
Note: I am looking for something with as few code lines as possible to do this.

I suggest something along the line of
from collections import defaultdict
original_files = ['C:/0001.jpg',
'C:/0033.jpg',
'C:/0061.jpg',
'C:/0080.jpg',
'C:/0204.jpg',
'C:/0241.jpg']
related_files = ['C:/0001_PM.png',
'C:/0033_PMA.png',
'C:/0033_NM.png',
'C:/0061_PMLTS.png',
'C:/0080_PM.png',
'C:/0080_RS.png',
'C:/0204_PM.png']
def key1(filename):
return filename.rsplit('/', 1)[-1].rsplit('.', 1)[0]
def key2(filename):
return key1(filename).split('_', 1)[0]
d = defaultdict(list)
for x in related_files:
d[key2(x)].append(x)
for x in original_files:
related = d.get(key1(x), [])
print(x, '->', related)
In key1() and key2() you could alternately use os.path functions or pathlib.Path methods.

Here's a solution that returns only the matched relative_files.
import os, re
def get_index(filename):
m = re.match('([0-9]+)', os.path.split(filename)[1])
return m.group(1) if m else False
indexes = filter(bool, map(get_index, original_files))
[f for f in related_files if get_index(f) in indexes]

Make use of defaultdict.
import os, re
from collections import defaultdict
stragglers = []
grouped_files = defaultdict(list)
file_index = re.compile('([0-9]+)')
for f in original_files + related_files:
m = file_index.match(os.path.split(f)[1])
if m:
grouped_files[m.group(1)].append(f)
else:
stragglers.append(f)
You now have grouped_files, a dict (or dictionary-like object) of key-value pairs where the key is the regex matched part of the filename and the value is a list of matching filenames.
for x in grouped_files.items():
print(x)
# ('0204', ['C:/0204.jpg', 'C:/0204_PM.png'])
# ('0001', ['C:/0001.jpg', 'C:/0001_PM.png'])
# ('0033', ['C:/0033.jpg', 'C:/0033_PM.png'])
# ('0061', ['C:/0061.jpg', 'C:/0061_PM.png'])
# ('0241', ['C:/0241.jpg'])
# ('0080', ['C:/0080.jpg', 'C:/0080_PM.png'])
In stragglers you have any filenames that didn't match your regex.
print(stragglers)
# []

For python 3.X you can try to use this:
for origfiles in original_files:
for relfiles in related_files:
if origfiles[3:6] == relfiles[3:6]:
print(origfiles)

removing iterated string from string array

I am writing a small script that lists the currently connected hard disks on my machine. I only need the disk identifier(disk0), not the partition ID(disk0s1, disk0s2, etc.)
How can I iterate through an array that contains diskID and partitionID and remove the partitionID entries? Here's what I'm trying so far:
import os
allDrives = os.listdir("/dev/")
parsedDrives = []
def parseAllDrives():
parsedDrives = []
matching = []
for driveName in allDrives:
if 'disk' in driveName:
parsedDrives.append(driveName)
else:
continue
for itemName in parsedDrives:
if len(parsedDrives) != 0:
if 'rdisk' in itemName:
parsedDrives.remove(itemName)
else:
continue
else:
continue
#### this is where the problem starts: #####
# iterate through possible partition identifiers
for i in range(5):
#create a string for the partitionID
systemPostfix = 's' + str(i)
matching.append(filter(lambda x: systemPostfix in x, parsedDrives))
for match in matching:
if match in parsedDrives:
parsedDrives.remove(match)
print("found a mactch and removed it")
print("matched: %s" % matching)
print(parsedDrives)
parseAllDrives()
That last bit is just the most recent thing I've tried. Definitely open to going a different route.

try beginning with
allDrives = os.listdir("/dev/")
disks = [drive for drive in allDrives if ('disk' in drive)]
then, given disks id's are only 5-chars length
short_disks = [disk[:6] for disk in disks]
unique_short_disks = list(set(short_disks))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compare two files from filelist using regex? - python

Related

Store smallest number from a list based on criteria

Find file in directory with the max number given a set of different file names

Python: Retrieving and renaming indexed files in a directory

Filename string comparison in list search fails [Python]

removing iterated string from string array

Categories

Resources