Store smallest number from a list based on criteria - python

I have a list of file names as strings where I want to store, in a list, the file name with the minimum ending number relative to file names that have the same beginning denotation.
Example: For any file names in the list beginning with '2022-04-27_Cc1cPL3punY', I'd only want to store the file name with the minimum value of the number at the end. In this case, it would be the file name with 2825288523641594007, and so on for other beginning denotations.
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']

Given that your files would already be sorted in ascending order form your OS/file-manager, you can just find the first one from each common prefix
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
prefix_old = None
prefix = None
for f in files:
parts = f.split('_', 2)
prefix = '_'.join(parts[:2])
if prefix != prefix_old:
value = parts[2].split('.')[0]
print(f'Min value with prefix {prefix} is {value}')
prefix_old = prefix
Output
Min value with prefix 2022-04-27_Cc1a6yWpUeQ is 2825282726106954381
Min value with prefix 2022-04-27_Cc1cPL3punY is 2825288523641594007
Min value with prefix 2022-04-27_Cc1dN3Rp0es is 2825292832374568690

It seems that the list of files you have is already sorted according to groups of prefixes, and then according to the numbers. If that's indeed the case, you just need to take the first path of each prefix group. This can be done easily with itertools.groupby:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", next(group))
If you can't rely that they are internally ordered, find the minimum of each group according to the number:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))
And if you can't even rely that it's ordered by groups, just sort the list beforehand:
files.sort(key=lambda file: file.rsplit('_', 1)[0])
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))

If the same pattern is being followed, you can try to split each name by a separator (In your example '.' and '_'. Documentation on how split works here), and then sort that list by sorting a list of lists, as explained here. This will need to be done per each "ID", as I will call each group identifier's, so we'll first need to get the unique IDs, and then iterate them. After that, we can proceed with the splitting. By doing this, you'll get a list of lists with the complete file name in position 0, and the number from the suffix in position 1
prefix = list(set([pre.split('_')[1] for pre in names]))
names_split = []
for pre in prefix:
names_split.append([pre,[[name, name.split('.')[0].split('_')[2]] for name in names if name.split('_')[1] == pre]])
for i in range(len(prefix)):
names_split[i][1] =sorted(names_split[i][1], key=lambda x: int(x[1]))
print(names_split)
The file you need should be names_split[x][0][0] where x identifies each ID.
PS: If you need to find a particular ID, you can use
searched_index = [value[0] for value in names_split].index(ID)
and then names_split[searched_index][0][0]]
Edit: Changed the splitted characters order and added docs on split method
Edit 2: Added prefix grouping

Your best bet is probably to use the pandas library, it is very good at dealing with tabular data.
import pandas as pd
file_name_list = [] # Fill in here
file_name_list = [file_name[:-4] for file_name in file_name_list] # Get rid of .jpg
file_name_series = pd.Series(file_name_list) # Put the data in pandas
file_name_table = file_name_series.str.split("_", expand=True) # Split the strings
file_name_table.columns = ['date', 'prefix', 'number'] # Renaming for readability
file_name_table['number'] = file_name_table['number'].astype(int)
smallest_file_names = file_name_table.groupby(by=['date', 'prefix'])['number'].min()
smallest_file_names_list = smallest_file_names.to_list()
smallest_file_names_list = [file_name+'.jpg' for file_name in smallest_file_names_list] # Putting the .jpg back

Related

How to sort a Python dictionary by a substring contained in the keys, according to the order set in a list?

I'm very new to Python and I'm stuck on a task. First I made a file containing a number of fasta files with sequence names into a dictionary, then managed to select only those I want, based on substrings included in the keys which are defined in list "flu_genes".
Now I'm trying to reorder the items in this dictionary based on the order of substrings defined in the list "flu_genes". I'm completely stuck; I found a way of reordering based on the key order in a list BUT it is not my case, as the order is defined not by the keys but by a substring within the keys.
Should also add that in this case the substring its at the end with format "_GENE", however it could be in the middle of the string with the same format, perhaps "GENE", therefore I'd rather not rely on a code to find the substring at the end of the string.
I hope this is clear enough and thanks in advance for any help!
"full_genome.fasta"
>A/influenza/1/1_NA
atgcg
>A/influenza/1/1_NP
ctgat
>A/influenza/1/1_FluB
agcta
>A/influenza/1/1_HA
tgcat
>A/influenza/1/1_FluC
agagt
>A/influenza/1/1_M
tatag
consensus = {}
flu_genes = ['_HA', '_NP', '_NA', '_M']
with open("full_genome.fasta", 'r') as myseq:
for line in myseq:
line = line.rstrip()
if line.startswith('>'):
key = line[1:]
else:
if key in consensus:
consensus[key] += line
else:
consensus[key] = line
flu_fas = {key : val for key, val in consensus.items() if any(ele in key for ele in flu_genes)}
print("Dictionary after removal of keys : " + str(flu_fas))
>>>Dictionary after removal of keys : {'>A/influenza/1/1_NA': 'atgcg', '>A/influenza/1/1_NP': 'ctgat', '>A/influenza/1/1_HA': 'tgcat', '>A/influenza/1/1_M': 'tatag'}
#reordering by keys order (not going to work!) as in: https://try2explore.com/questions/12586065
reordered_dict = {k: flu_fas[k] for k in flu_genes}
A dictionary is fundamentally unsorted, but as an implementation detail of python3 it remembers its insertion order, and you're not going to change anything later, so you can do what you're doing.
The problem is, of course, that you're not working with the actual keys. So let's just set up a list of the keys, and sort that according to your criteria. Then you can do the other thing you did, except using the actual keys.
flu_genes = ['_HA', '_NP', '_NA', '_M']
def get_gene_index(k):
for index, gene in enumerate(flu_genes):
if k.endswith(gene):
return index
raise ValueError('I thought you removed those already')
reordered_keys = sorted(flu_fas.keys(), key=get_gene_index)
reordered_dict = {k: flu_fas[k] for k in reordered_keys}
for k, v in reordered_dict.items():
print(k, v)
A/influenza/1/1_HA tgcat
A/influenza/1/1_NP ctgat
A/influenza/1/1_NA atgcg
A/influenza/1/1_M tatag
Normally, I wouldn't do an n-squared sort, but I'm assuming the lines in the data file is much larger than the number of flu_genes, making that essentially a fixed constant.
This may or may not be the best data structure for your application, but I'll leave that to code review.
It's because you are trying to reorder it with non-existent dictionary keys. Your keys are
['>A/influenza/1/1_NA', '>A/influenza/1/1_NP', '>A/influenza/1/1_HA', '>A/influenza/1/1_M']
which doesn't match the list
['_HA', '_NP', '_NA', '_M']
you first need to get transform them to make them match and since we know the pattern that it's at the end of the string starting with an underscore, we can split at underscores and get the last match.
consensus = {}
flu_genes = ['_HA', '_NP', '_NA', '_M']
with open("full_genome.fasta", 'r') as myseq:
for line in myseq:
line = line.rstrip()
if line.startswith('>'):
sequence = line
gene = line.split('_')[-1]
key = f"_{gene}"
else:
consensus[key] = {
'sequence': sequence,
'data': line
}
flu_fas = {key : val for key, val in consensus.items() if any(ele in key for ele in flu_genes)}
print("Dictionary after removal of keys : " + str(flu_fas))
reordered_dict = {k: flu_fas[k] for k in flu_genes}

How to delete paths that contain the same names?

I have a list of paths that look like this (see below). As you can see, file-naming is inconsistent, but I would like to keep only one file per person. I already have a function that removes duplicates if they have the exact same file name but different file extensions, however, with this inconsistent file-naming case it seems trickier.
The list of files looks something like this (but assume there are thousands of paths and words that aren't part of the full names e.g. cv, curriculum vitae etc.):
all_files =
['cv_bob_johnson.pdf',
'bob_johnson_cv.pdf',
'curriculum_vitae_bob_johnson.pdf',
'cv_lara_kroft_cv.pdf',
'cv_lara_kroft.pdf' ]
Desired output:
unique_files = ['cv_bob_johnson.pdf', 'cv_lara_kroft.pdf']
Given that the names are somewhat in a written pattern most of the time (e.g. first name precedes last name), I assume there has to be a way of getting a unique set of the paths if the names are repeated?
If you want to keep your algorithm relatively simple (i.e., not using ML etc), you'll need to have some idea about the typical substrings that you want to remove. Let's make a list of such substrings, for example:
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
Then you can process your list of files this way:
import re
all_files = ['cv_bob_johnson.pdf', 'bob_johnson_cv.pdf', 'curriculum_vitae_bob_johnson.pdf', 'cv_lara_kroft_cv.pdf', 'cv_lara_kroft.pdf']
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
unique = []
for file in all_files:
# strip a suffix, if any:
try:
name, suffix = file.rsplit('.', 1)
except:
name, suffix = file, None
# remove the excess parts:
for rem in remove:
name = re.sub(rem, '', name)
# append the result to the list:
unique.append(f'{name}.{suffix}' if suffix else name)
# remove duplicates:
unique = list(set(unique))
print(unique)

Update values in string based with values from pandas data frame

Given the following Data Frame:
df = pd.DataFrame({'term' : ['analys','applic','architectur','assess','item','methodolog','research','rs','studi','suggest','test','tool','viewer','work'],
'newValue' : [0.810419, 0.631963 ,0.687348, 0.810554, 0.725366, 0.742715, 0.799152, 0.599030, 0.652112, 0.683228, 0.711307, 0.625563, 0.604190, 0.724763]})
df = df.set_index('term')
print(df)
newValue
term
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
I am trying to update values in this string behind each "^" with the values from the Data Frame.
(analysi analys^0.8046919107437134 studi^0.6034331321716309 framework methodolog^0.7360332608222961 architectur^0.6806665658950806)^0.0625 (recommend suggest^0.6603200435638428 rs^0.5923488140106201)^0.125 (system tool^0.6207902431488037 applic^0.610009491443634)^0.25 (evalu assess^0.7828741073608398 test^0.6444937586784363)^0.5
Additionally, this should be done with regard to the corresponding word such that I get this:
(analysi analys^0.810419 studi^0.652112 framework methodolog^0.742715 architectur^0.687348)^0.0625 (recommend suggest^0.683228 rs^0.599030)^0.125 (system tool^0.625563 applic^0.631963)^0.25 (evalu assess^0.810554 test^0.711307)^0.5
Thanks in advance for helping!
The best way I could come up with does this in multiple stages.
First, take the old string and extract all the values that you want to replace. that can be done with a regular expression.
old_string = "(analysi analys^0.8046919107437134 studi^0.6034331321716309 framework methodolog^0.7360332608222961 architectur^0.6806665658950806)^0.0625 (recommend suggest^0.6603200435638428 rs^0.5923488140106201)^0.125 (system tool^0.6207902431488037 applic^0.610009491443634)^0.25 (evalu assess^0.7828741073608398 test^0.6444937586784363)^0.5"
pattern = re.compile(r"(\w+\^(0|[1-9]\d*)(\.\d+)?)")
# pattern.findall(old_string) returns a list of tuples,
# so we need to keep just the outer capturing group for each match.
matches = [m[0] for m in pattern.findall(old_string)]
print("Matches:", matches)
In the next part, we make two dictionaries. One is a dictionary of the prefix (word part, before ^) of the values to replace to the whole value. We use that to create the second dictionary, from the values to replace to the new values (from the dataframe).
prefix_dict = {}
for m in matches:
pre, post = m.split('^')
prefix_dict[pre] = m
print("Prefixes:", prefix_dict)
matches_dict = {}
for i, row in df.iterrows(): # df is the dataframe from the question
if i in prefix_dict:
old_val = prefix_dict[i]
new_val = "%s^%s" % (i, row.newValue)
matches_dict[old_val] = new_val
print("Matches dict:", matches_dict)
With that done, we can loop through the items in the old value > new value dictionary and replace all the old values in the input string.
new_string = old_string
for key, val in matches_dict.items():
new_string = new_string.replace(key, val)
print("New string:", new_string)

How to compare two files from filelist using regex?

The file is reading from a folder with os.listdir. After I entered regex of the file r'^[1-9\w]{2}_[1-9\w]{4}[1][7][\d\w]+\.[\d\w]+' and the similar for another file r'^[1-9\w]{2}_[1-9\w]{4}[1][8]+' . The condition of the comparison is that when the first seven symbols are matching then os.remove(os.path.join(dir_name, each)) . Example of a few: bh_txbh171002.xml, bh_txbh180101.xml, ce_txce170101.xml...
As I understood we can't use match because there's no any string and it returns None, moreover it compares file with regex only. I am thinking about the condition if folder.itself(file) and file.startswitch("......."): But can't figure out how could I point the first seven symbols of file names what should be compared.
Honestly, I've placed my worse version of the code in that request and since that time I learnt a little bit more: the link - press to check it up
Regex is the wrong tool here I do not have your files so I create randomized demodata:
import random
import string
random.seed(42) # make random repeatable
def generateFileNames(amount):
"""Generate 2*amount of names XX_XXXX with X in [a-zA-T0-9] with duplicates in it"""
def rndName():
"""generate one random name XX_XXXX with X in [a-zA-T0-9]"""
characters = string.ascii_lowercase + string.digits
return random.choices(characters,k=2)+['_']+random.choices(characters,k=4)
for _ in range(amount): # create 2*amount names, some duplicates
name = rndName()
yield ''.join(name) # yield name once
if random.randint(1,10) > 3: # more likely to get same names twice
yield ''.join(name) # same name twice
else:
yield ''.join(rndName()) # different 2nd name
def generateNumberParts(amount):
"""Generate 2*amount of 6-digit-strings, some with 17+18 as starting numbers"""
def rndNums(nr):
"""Generate nr digits as string list"""
return random.choices(string.digits,k=nr)
for _ in range(amount):
choi = rndNums(4)
# i am yielding 18 first to demonstrate that sorting later works
yield ''.join(['18']+choi) # 18xxxx numbers
if random.randint(1,10) > 5:
yield ''.join(['17']+choi) # 17xxxx
else:
yield ''.join(rndNums(6)) # make it something other
# half the amount of files generated
m = 10
# generate filenames
filenames = [''.join(x)+'.xml' for x in zip(generateFileNames(m),
generateNumberParts(m)]
Now I have my names as list and can start to find out which are dupes with newer timestamps:
# make a dict out of your filenames, use first 7 as key
# with list of values of files starting with this key a values:
fileDict={}
for names in filenames:
fileDict.setdefault(names[0:7],[]).append(names) # create key=[] or/and append names
for k,v in fileDict.items():
print (k, " " , v)
# get files to delete (all the lower nr of the value-list if multiple in it)
filesToDelete = []
for k,v in fileDict.items():
if len(v) == 1: # nothing to do, its only 1 file
continue
print(v, " to ", end = "" ) # debugging output
v.sort(key = lambda x: int(x[7:9])) # sort by a lambda that integerfies 17/18
print (v) # debugging output
filesToDelete.extend(v[:-1]) # add all but the last file to the delete list
print("")
print(filesToDelete)
Output:
# the created filenames in your dict by "key [values]"
xa_ji0y ['xa_ji0y188040.xml', 'xa_ji0y501652.xml']
v3_a3zm ['v3_a3zm181930.xml']
mm_jbqe ['mm_jbqe171930.xml']
ck_w5ng ['ck_w5ng180679.xml', 'ck_w5ng348136.xml']
zy_cwti ['zy_cwti184296.xml', 'zy_cwti174296.xml']
41_iblj ['41_iblj182983.xml', '41_iblj172983.xml']
5x_ff0t ['5x_ff0t187453.xml']
sd_bdw2 ['sd_bdw2177453.xml']
vn_vqjt ['vn_vqjt189618.xml', 'vn_vqjt179618.xml']
ep_q85j ['ep_q85j185198.xml', 'ep_q85j175198.xml']
vf_1t2t ['vf_1t2t180309.xml', 'vf_1t2t089040.xml']
11_ertj ['11_ertj188425.xml', '11_ertj363842.xml']
# sorting the names by its integer at 8/9 position of name
['xa_ji0y188040.xml','xa_ji0y501652.xml'] to ['xa_ji0y188040.xml','xa_ji0y501652.xml']
['ck_w5ng180679.xml','ck_w5ng348136.xml'] to ['ck_w5ng180679.xml','ck_w5ng348136.xml']
['zy_cwti184296.xml','zy_cwti174296.xml'] to ['zy_cwti174296.xml','zy_cwti184296.xml']
['41_iblj182983.xml','41_iblj172983.xml'] to ['41_iblj172983.xml','41_iblj182983.xml']
['vn_vqjt189618.xml','vn_vqjt179618.xml'] to ['vn_vqjt179618.xml','vn_vqjt189618.xml']
['ep_q85j185198.xml','ep_q85j175198.xml'] to ['ep_q85j175198.xml','ep_q85j185198.xml']
['vf_1t2t180309.xml','vf_1t2t089040.xml'] to ['vf_1t2t089040.xml','vf_1t2t180309.xml']
['11_ertj188425.xml','11_ertj363842.xml'] to ['11_ertj188425.xml','11_ertj363842.xml']
# list of files to delete
['xa_ji0y188040.xml', 'ck_w5ng180679.xml', 'zy_cwti174296.xml', '41_iblj172983.xml',
'vn_vqjt179618.xml', 'ep_q85j175198.xml', 'vf_1t2t089040.xml', '11_ertj188425.xml']
I can't understand what's wrong with my code. There I defined the list from certain folder, so that I could work at the strings in each file, right? Then I applied the conditions for filtering and further choice of the one file to delete.
import os
dir_name = "/Python/Test_folder/Schems"
filenames = os.listdir(dir_name)
for names in filenames:
filenames.setdefault(names[0:7],[]).append(names) # create key=[] or/and append names
for k,v in filenames.items():
filesToDelete = [] #ther's a syntax mistake. But I can't get it - there's the list or not?
for k,v in filenames.items():
if len(v) == 1:
continue
v.sort(key = lambda x: int(x[7:9]))
filesToDelete.extend(v[:-1])

How to group similarly named elements in a list into tuples in python?

I have read the names of all of the files in a directory in a python list like this:
files = ['ch1.txt', 'ch2.txt', 'ch3_1.txt', 'ch4_2.txt', 'ch3_2.txt', 'ch4_1.txt']
What I want to do is group similar files as tuples in the list. The above example should look like
files_grouped = ['ch1.txt', 'ch2.txt', ('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]
One way I have tried is to separate the elements I need to group from the list like so
groups = tuple([file for file in files if '_' in file])
single = [file for file in files if not '_' in file]
And I would create a new list appending the both. But how do I create the groups as list of tuple for ch3 and ch4 like [('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')] instead of one big tuple?
None of the answers give you a generic solution that works for any kind of file names. I think you should be using regex, if you want to account for that.
import itertools
import re
sorted_files = sorted(files, key=lambda x: re.findall('(\d+)_(\d+)', x))
out = [list(g) for _, g in itertools.groupby(sorted_files,
key=lambda x: re.search('\d+', x).group() )]
print(out)
[['ch1.txt'],
['ch2.txt'],
['ch3_1.txt', 'ch3_2.txt'],
['ch4_1.txt', 'ch4_2.txt']]
Note that this should work for any naming format, not just chX_X.
If you want your output in the exact format described, you could do a little extra post-processing:
out = [o[0] if len(o) == 1 else tuple(o) for o in out]
print(out)
['ch1.txt', 'ch2.txt', ('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]
Regex Details
The first regex sorts by chapter section and subsection.
( # first group
\d+ # 1 or more digits
)
_ # literal underscore
( # second group
\d+ # 1 or more digits
)
The second regex groups by chapter sections only - all chapters with the same section are grouped together.
You could use a dictionary (or, for simpler initialising a collections.defaultdict:
from collections import defaultdict
from pprint import pprint
files = ['ch1.txt', 'ch2.txt', 'ch3_1.txt', 'ch4_2.txt', 'ch3_2.txt', 'ch4_1.txt']
grouped = defaultdict(list) # create an empty list for not existent entries
for f in files:
key = f[:3]
grouped[key].append(f)
pprint(grouped)
Result:
defaultdict(<class 'list'>,
{'ch1': ['ch1.txt'],
'ch2': ['ch2.txt'],
'ch3': ['ch3_1.txt', 'ch3_2.txt'],
'ch4': ['ch4_2.txt', 'ch4_1.txt']})
If you want your list of tuples, you can do:
grouped = [tuple(l) for l in grouped.values()]
Which is
[('ch1.txt',),
('ch2.txt',),
('ch3_1.txt', 'ch3_2.txt'),
('ch4_2.txt', 'ch4_1.txt')]
Maybe you can sort the list of file name, and then use groupby() to do this:
e.g.
from itertools import groupby
files = ['ch1.txt', 'ch2.txt', 'ch3_1.txt', 'ch4_2.txt', 'ch3_2.txt', 'ch4_1.txt']
print([tuple(g) for k,g in groupby(sorted(files),key=lambda x : x[:-4].split("_")[0])])
Result:
[('ch1.txt',), ('ch2.txt',), ('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]
Hope this helps.

Categories

Resources