How to delete paths that contain the same names? - python

I have a list of paths that look like this (see below). As you can see, file-naming is inconsistent, but I would like to keep only one file per person. I already have a function that removes duplicates if they have the exact same file name but different file extensions, however, with this inconsistent file-naming case it seems trickier.
The list of files looks something like this (but assume there are thousands of paths and words that aren't part of the full names e.g. cv, curriculum vitae etc.):
all_files =
['cv_bob_johnson.pdf',
'bob_johnson_cv.pdf',
'curriculum_vitae_bob_johnson.pdf',
'cv_lara_kroft_cv.pdf',
'cv_lara_kroft.pdf' ]
Desired output:
unique_files = ['cv_bob_johnson.pdf', 'cv_lara_kroft.pdf']
Given that the names are somewhat in a written pattern most of the time (e.g. first name precedes last name), I assume there has to be a way of getting a unique set of the paths if the names are repeated?

If you want to keep your algorithm relatively simple (i.e., not using ML etc), you'll need to have some idea about the typical substrings that you want to remove. Let's make a list of such substrings, for example:
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
Then you can process your list of files this way:
import re
all_files = ['cv_bob_johnson.pdf', 'bob_johnson_cv.pdf', 'curriculum_vitae_bob_johnson.pdf', 'cv_lara_kroft_cv.pdf', 'cv_lara_kroft.pdf']
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
unique = []
for file in all_files:
# strip a suffix, if any:
try:
name, suffix = file.rsplit('.', 1)
except:
name, suffix = file, None
# remove the excess parts:
for rem in remove:
name = re.sub(rem, '', name)
# append the result to the list:
unique.append(f'{name}.{suffix}' if suffix else name)
# remove duplicates:
unique = list(set(unique))
print(unique)

Related

How can I remove files that have an unknown number in them?

I have some code that writes out files with names like this:
body00123.txt
body00124.txt
body00125.txt
body-1-2126.txt
body-1-2127.txt
body-1-2128.txt
body-3-3129.txt
body-3-3130.txt
body-3-3131.txt
Such that the first two numbers in the file can be 'negative', but the last 3 numbers are not.
I have a list such as this:
123
127
129
And I want to remove all the files that don't end with one of these numbers. An example of the desired leftover files would be like this:
body00123.txt
body-1-2127.txt
body-3-3129.txt
My code is running in python, so I have tried:
if i not in myList:
os.system('rm body*' + str(i) + '.txt')
And this resulted in every file being deleted.
The solution should be such that any .txt file that ends with a number contained by myList should be kept. All other .txt files should be deleted. This is why I'm trying to use a wildcard in my attempt.
Because all the files end in .txt, you can cut that part out and use the str.endswith() function. str.endswith() accepts a tuple of strings, and sees if your string ends in any of them. As a result, you can do something like this:
all_file_list = [...]
keep_list = [...]
files_to_remove = []
file_to_remove_tup = tuple(files_to_remove)
for name in all_file_list:
if name[:-4].endswith(file_to_remove_tup)
files_to_remove.append(name)
# or os.remove(name)

Store smallest number from a list based on criteria

I have a list of file names as strings where I want to store, in a list, the file name with the minimum ending number relative to file names that have the same beginning denotation.
Example: For any file names in the list beginning with '2022-04-27_Cc1cPL3punY', I'd only want to store the file name with the minimum value of the number at the end. In this case, it would be the file name with 2825288523641594007, and so on for other beginning denotations.
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
Given that your files would already be sorted in ascending order form your OS/file-manager, you can just find the first one from each common prefix
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
prefix_old = None
prefix = None
for f in files:
parts = f.split('_', 2)
prefix = '_'.join(parts[:2])
if prefix != prefix_old:
value = parts[2].split('.')[0]
print(f'Min value with prefix {prefix} is {value}')
prefix_old = prefix
Output
Min value with prefix 2022-04-27_Cc1a6yWpUeQ is 2825282726106954381
Min value with prefix 2022-04-27_Cc1cPL3punY is 2825288523641594007
Min value with prefix 2022-04-27_Cc1dN3Rp0es is 2825292832374568690
It seems that the list of files you have is already sorted according to groups of prefixes, and then according to the numbers. If that's indeed the case, you just need to take the first path of each prefix group. This can be done easily with itertools.groupby:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", next(group))
If you can't rely that they are internally ordered, find the minimum of each group according to the number:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))
And if you can't even rely that it's ordered by groups, just sort the list beforehand:
files.sort(key=lambda file: file.rsplit('_', 1)[0])
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))
If the same pattern is being followed, you can try to split each name by a separator (In your example '.' and '_'. Documentation on how split works here), and then sort that list by sorting a list of lists, as explained here. This will need to be done per each "ID", as I will call each group identifier's, so we'll first need to get the unique IDs, and then iterate them. After that, we can proceed with the splitting. By doing this, you'll get a list of lists with the complete file name in position 0, and the number from the suffix in position 1
prefix = list(set([pre.split('_')[1] for pre in names]))
names_split = []
for pre in prefix:
names_split.append([pre,[[name, name.split('.')[0].split('_')[2]] for name in names if name.split('_')[1] == pre]])
for i in range(len(prefix)):
names_split[i][1] =sorted(names_split[i][1], key=lambda x: int(x[1]))
print(names_split)
The file you need should be names_split[x][0][0] where x identifies each ID.
PS: If you need to find a particular ID, you can use
searched_index = [value[0] for value in names_split].index(ID)
and then names_split[searched_index][0][0]]
Edit: Changed the splitted characters order and added docs on split method
Edit 2: Added prefix grouping
Your best bet is probably to use the pandas library, it is very good at dealing with tabular data.
import pandas as pd
file_name_list = [] # Fill in here
file_name_list = [file_name[:-4] for file_name in file_name_list] # Get rid of .jpg
file_name_series = pd.Series(file_name_list) # Put the data in pandas
file_name_table = file_name_series.str.split("_", expand=True) # Split the strings
file_name_table.columns = ['date', 'prefix', 'number'] # Renaming for readability
file_name_table['number'] = file_name_table['number'].astype(int)
smallest_file_names = file_name_table.groupby(by=['date', 'prefix'])['number'].min()
smallest_file_names_list = smallest_file_names.to_list()
smallest_file_names_list = [file_name+'.jpg' for file_name in smallest_file_names_list] # Putting the .jpg back

Extracting numbers from a filename string in python

I have a number of html files in a directory. I am trying to store the filenames in a list so that I can use it later to compare with another list.
Eg: Prod224_0055_00007464_20170930.html is one of the filenames. From the filename, I want to extract '00007464' and store this value in a list and repeat the same for all the other files in the directory. How do I go about doing this? I am new to Python and any help would be greatly appreciated!
Please let me know if you need more information to answer the question.
Split the filename on underscores and select the third element (index 2).
>>> 'Prod224_0055_00007464_20170930.html'.split('_')[2]
'00007464'
In context that might look like this:
nums = [f.split('_')[2] for f in os.listdir(dir) if f.endswith('.html')]
you may try this (assuming you are in the folder with the files:
import os
num_list = []
r, d, files = os.walk( '.' ).next()
for f in files :
parts = f.split('_') # now `parts` contains ['Prod224', '0055', '00007464', '20170930.html']
print parts[2] # this outputs '00007464'
num_list.append( parts[2] )
Assuming you have a certain pattern for your files, you can use a regex:
>>> import re
>>> s = 'Prod224_0055_00007464_20170930.html'
>>> desired_number = re.findall("\d+", s)[2]
>>> desired_number
'00007464'
Using a regex will help you getting not only that specific number you want, but also other numbers in the file name.
This will work if the name of your files follow the pattern "[some text][number]_[number]_[desired_number]_[a date].html". After getting the number, I think it will be very simple to use the append method to add that number to any list you want.

Python: Extract all characters between a two groups of characters in a string

Say I have a list of strings corresponding to a file name:
files = ['variable_timestep_model321_experiment123.csv',
'variable_timestep_model2_experiment21.csv',
'variable_timestep_model4321_experimentname1234.csv',
'variable_timestep_model0_experiment0.csv']
Where each file name has the format:
'variable_timestemp_modelname_experimentname.csv'
In this filename variable and timestemp are not changing, however the modelname and experimentname are. I want to extract the model name, so the expected output would be:
['model321',
'model2',
'model4321',
'model0']
Because I know variable_timestemp_ stays the same, is there an easy way to extract X amount of characters between variable_timestep_ on the left and _ on the right?
One way:
models = [name.split('_')[2] for name in files]

Exclude the last delimited item from my list

I want to remove the last string in the list i.e. the library name (delimited by '\'). The text string that I have contains path of libraries used at the compilation time. These libraries are delimited by spaces. I want to retain each path but not till the library name, just one root before it.
Example:
text = " /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtbeginT.o /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtfastmath.o /opt/cray/cce/8.2.5/craylibs/x86-64/no_mmap.o /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymp.a /opt/cray/atp/1.7.1/lib/libAtpSigHandler.a /opt/cray/atp/1.7.1/lib/libAtpSigHCommData.a "
I want my output to be like -
Output_list =
[/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4,
/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/atp/1.7.1/lib,
/opt/cray/atp/1.7.1/lib]
and finally I want to remove the duplicates in the output_list so that the list looks like.
New_output_list =
[/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4,
/opt/cray/cce/8.2.5/craylibs/x86-64,
/opt/cray/atp/1.7.1/lib]
I am getting the results using split() function but I am struggling to discard the library names from the path.
any help would be appreciated.
You seem to want (don't try and do string operations with paths, it's bound to end badly):
import os
New_output_List = list(set(os.path.dirname(pt) for pt in text.split()))
os.path.dirname splits a path into it's gets the directory name from a path. We do this for every item in the text, split into a list based on white-space. This is done for every item in the series.
To remove the duplicates, we just convert it to a set and then finally to a list.
try with this
text = " /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtbeginT.o /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/crtfastmath.o /opt/cray/cce/8.2.5/craylibs/x86-64/no_mmap.o /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.2.5/craylibs/x86-64/libcraymp.a /opt/cray/atp/1.7.1/lib/libAtpSigHandler.a /opt/cray/atp/1.7.1/lib/libAtpSigHCommData.a "
New_output_List = []
for x in list(set(text.split(' '))):
New_output_list.append("".join("/" + y if y else '' for y in x.split("/")[:-1]))

Categories

Resources