Remove file name duplicates in a list - python

I have a list l:
l = ['Abc.xlsx', 'Wqe.csv', 'Abc.csv', 'Xyz.xlsx']
In this list, I need to remove duplicates without considering the extension. The expected output is below.
l = ['Wqe.csv', 'Abc.csv', 'Xyz.xlsx']
I tried:
l = list(set(x.split('.')[0] for x in l))
But getting only unique filenames without extension
How could I achieve it?

You can use a dictionary comprehension that uses the name part as key and the full file name as the value, exploiting the fact that dict keys must be unique:
>>> list({x.split(".")[0]: x for x in l}.values())
['Abc.csv', 'Wqe.csv', 'Xyz.xlsx']
If the file names can be in more sophisticated formats (such as with directory names, or in the foo.bar.xls format) you should use os.path.splitext:
>>> import os
>>> list({os.path.splitext(x)[0]: x for x in l}.values())
['Abc.csv', 'Wqe.csv', 'Xyz.xlsx']

If the order of the end result doesn't matter, we could split each item on the period. We'll regard the first item in the list as the key and then keep the item if the key is unique.
oldList = l
setKeys = set()
l = []
for item in oldList:
itemKey = item.split(".")[0]
if itemKey in setKeys:
pass
else:
setKeys.add(itemKey)
l.append(item)

Try this
l = ['Abc.xlsx', 'Wqe.csv', 'Abc.csv', 'Xyz.xlsx']
for x in l:
name = x.split('.')[0]
find = 0
for index,d in enumerate(l, start=0):
txt = d.split('.')[0]
if name == txt:
find += 1
if find > 1:
l.pop(index)
print(l)

#Selcuk Definitely the best solution, unfortunately I don't have enough reputation to vote you answer.
But I would rather use el[:el.rfind('.')] as my dictionary key than os.path.splitext(x)[0] in order to handle the case where we have sophisticated formats in the name. that will give something like this:
list({x[:x.rfind('.')]: x for x in l}.values())

Related

Python get latest files from list

I have a list of files like this:
my_list=['l.txt','PPT_6_202008062343HLC.txt','PPT_6_202008070522HLC.txt','PPT_12_202008062343HLC.txt','PPT_12_202008070522HLC.txt']
and I want to have a final list with the latest that files that begins with ppt_6 and ppt_12 and keep the other elements items, like this:
final_list=
['PPT_6_202008070522HLC.txt', 'PPT_12_202008070522HLC.txt', 'l.txt']
right now I'm doing this:
from datetime import datetime
now = datetime.now()
new_arc=[]
time_6=[]
time_12=[]
for i in my_list:
if i[4:5]=='6':
time_6.append(i)
elif i[4:5]=='1':
time_12.append(i)
else:
new_arc.append(i)
time_6 = [max(t for t in time_6 if datetime.strptime(t[-15:-3], '%Y%m%d%H%M') < now)]
time_12 = [max(t for t in time_12 if datetime.strptime(t[-15:-3], '%Y%m%d%H%M') < now)]
final_list=time_6+time_12+new_arc
is there a better way of doing this ?
The datetime format into these filenames allows you not to use datetime functions, alphabetical order is enough.
You can remove all items matching the two patterns and finally append the most recent of them, which are the maximum (alphabetically) elements.
p1 = [x for x in my_list if x.startswith("PPT_6")]
p2 = [x for x in my_list if x.startswith("PPT_12")]
result = [x for x in my_list if x not in p1 and x not in p2]
result.append(max(p1))
result.append(max(p2))
print(result)
Since the file names already have a date order, you could simply sort on them. Then group by the prefix (PPT_6 and PPT_12). Finally get the top row from each group.
from itertools import groupby
#get prefix up to nth _
def split_nth(text, n):
grp = text.split('_')
return '_'.join(grp[:n])
my_list =['l.txt','PPT_6_202008062343HLC.txt','PPT_6_202008070522HLC.txt',
'PPT_12_202008062343HLC.txt','PPT_12_202008070522HLC.txt']
sorted_list = sorted(my_list[1:], reverse=True)
groups = groupby(sorted_list, key=lambda x: split_nth(x, 2))
result = [next(v) for _, v in groups]
result.append(my_list[0])
The best I could come up with was this:
import re
my_list = [
'l.txt','PPT_6_202008062343HLC.txt','PPT_6_202008070522HLC.txt',
'PPT_12_202008062343HLC.txt','PPT_12_202008070522HLC.txt'
]
patterns = (re.compile("PPT_6"), re.compile("PPT_12"))
final_list = [sorted(list(filter(pattern.match, problem_list)))[0]
for pattern in patterns]
final_list += list(filter(re.compile("[^PPT]").match, problem_list))
Depending on how many file names you're going to be working with, I don't think it should be too bad.

Sorting a list based on another shorter list with incomplete values

I have a list of file paths which I need to order in a specific way prior to reading and processing the files. The specific way is defined by a smaller list which contains only some file names, but not all of them. All other file paths which are not listed in presorted_list need to stay in the order they had previously.
Examples:
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
expected_list = ['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
I've found some relating posts:
Sorting list based on values from another list?
How to sort a list according to another list?
But as far as I can tell the questions and answers always rely on two lists of the same length which I don't have (which results in errors like ValueError: 'bar_foo' is not in list) or a presorted list which needs to contain all possible values which I can't provide.
My Idea:
I've come up with a solution which seems to work but I'm unsure if this is a good way to approach the problem:
import os
import re
EXCPECTED_LIST = ['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
PRESORTED_LIST = ["foo_baz", "foo"]
def sort_function(item, len_list):
# strip path and unwanted parts
filename = re.sub(r"[\(\[].*?[\)\]]", "", os.path.basename(item)).split('.')[0]
if filename in PRESORTED_LIST:
return PRESORTED_LIST.index(filename)
return len_list
def main():
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv',]
list_length = len(some_list)
sorted_list = sorted(some_list, key=lambda x: sort_function(x, list_length))
assert sorted_list == EXCPECTED_LIST
if __name__ == "__main__":
main()
Are there other (shorter, more pythonic) ways of solving this problem?
Here is how I think I would do it:
import re
from collections import OrderedDict
from itertools import chain
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
expected_list = ['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
def my_sort(lst, presorted_list):
rgx = re.compile(r"^(.*/)?([^/(.]*)(\(.*\))?(\.[^.]*)?$")
d = OrderedDict((n, []) for n in presorted_list)
d[None] = []
for p in some_list:
m = rgx.match(p)
n = m.group(2) if m else None
if n not in d:
n = None
d[n].append(p)
return list(chain.from_iterable(d.values()))
print(my_sort(some_list, presorted_list) == expected_list)
# True
An easy implementation is to add some sentinels to the lines before sorting. So there is no need for specific ordering. Also regex may be avoid if all filenames respect the pattern you gave:
for n,file1 in enumerate(presorted_list):
for m,file2 in enumerate(some_list):
if '/'+file1+'.' in file2 or '/'+file1+'(' in file2:
some_list[m] = "%03d%03d:%s" % (n, m, file2)
some_list.sort()
some_list = [file.split(':',1)[-1] for file in some_list]
print(some_list)
Result:
['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
Let me think. It is a unique problem, I'll try to suggest a solution
only_sorted_elements = filter(lambda x:x.rpartition("/")[-1].partition(".")[0] in presorted_list , some_list)
only_sorted_elements.sort(key = lambda x:presorted_list.index(x.rpartition("/")[-1].partition(".")[0]))
expected_list = []
count = 0
for ind, each_element in enumerate(some_list):
if each_element not in presorted_list:
expected_list.append(each_element)
else:
expected_list[ind].append(only_sorted_elements[count])
count += 1
Hope this solves your problem.
I first filter for only those elements which are there in presorted_list,
then I sort those elements according to its order in presorted_list
Then I iterate over the list and append accordingly.
Edited :
Changed index parameters from filename with path to exact filename.
This will retain the original indexes of those files which are not in presorted list.
EDITED :
The new edited code will change the parameters and gives sorted results first and unsorted later.
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
only_sorted_elements = filter(lambda x:x.rpartition("/")[-1].partition("(")[0].partition(".")[0] in presorted_list , some_list)
unsorted_all = filter(lambda x:x.rpartition("/")[-1].partition("(")[0].partition(".")[0] not in presorted_list , some_list)
only_sorted_elements.sort(key = lambda x:presorted_list.index(x.rpartition("/")[-1].partition("(")[0].partition(".")[0]))
expected_list = only_sorted_elements + unsorted_all
print expected_list
Result :
['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
Since python's sort is already stable, you only need to provide it with a coarse grouping for the sort key.
Given the specifics of your sorting requirements this is better done using a function. For example:
def presort(presorted):
def sortkey(name):
filename = name.split("/")[-1].split(".")[0].split("(")[0]
if filename in presorted:
return presorted.index(filename)
return len(presorted)
return sortkey
sorted_list = sorted(some_list,key=presort(['foo_baz', 'foo']))
In order to keep the process generic and simple to use, the presorted_list should be provided as a parameter and the sort key function should use it to produce the grouping keys. This is achieved by returning a function (sortkey) that captures the presorted list parameter.
This sortkey() function returns the index of the file name in the presorted_list or a number beyond that for unmatched file names. So, if you have 2 names in the presorted_list, they will group the corresponding files under sort key values 0 and 1. All other files will be in group 2.
The conditions that you use to determine which part of the file name should be found in presorted_list are somewhat unclear so I only covered the specific case of the opening parenthesis. Within the sortkey() function, you can add more sophisticated parsing to meet your needs.

While loop within for loop for list of lists

I'm trying to create a big list that will contain lists of strings. I iterate over the input list of strings and create a temporary list.
Input:
['Mike','Angela','Bill','\n','Robert','Pam','\n',...]
My desired output:
[['Mike','Angela','Bill'],['Robert','Pam']...]
What i get:
[['Mike','Angela','Bill'],['Angela','Bill'],['Bill']...]
Code:
for i in range(0,len(temp)):
temporary = []
while(temp[i] != '\n' and i<len(temp)-1):
temporary.append(temp[i])
i+=1
bigList.append(temporary)
Use itertools.groupby
from itertools import groupby
names = ['Mike','Angela','Bill','\n','Robert','Pam']
[list(g) for k,g in groupby(names, lambda x:x=='\n') if not k]
#[['Mike', 'Angela', 'Bill'], ['Robert', 'Pam']]
Fixing your code, I'd recommend iterating over each element directly, appending to a nested list -
r = [[]]
for i in temp:
if i.strip():
r[-1].append(i)
else:
r.append([])
Note that if temp ends with a newline, r will have a trailing empty [] list. You can get rid of that though:
if not r[-1]:
del r[-1]
Another option would be using itertools.groupby, which the other answerer has already mentioned. Although, your method is more performant.
Your for loop was scanning over the temp array just fine, but the while loop on the inside was advancing that index. And then your while loop would reduce the index. This caused the repitition.
temp = ['mike','angela','bill','\n','robert','pam','\n','liz','anya','\n']
# !make sure to include this '\n' at the end of temp!
bigList = []
temporary = []
for i in range(0,len(temp)):
if(temp[i] != '\n'):
temporary.append(temp[i])
print(temporary)
else:
print(temporary)
bigList.append(temporary)
temporary = []
You could try:
a_list = ['Mike','Angela','Bill','\n','Robert','Pam','\n']
result = []
start = 0
end = 0
for indx, name in enumerate(a_list):
if name == '\n':
end = indx
sublist = a_list[start:end]
if sublist:
result.append(sublist)
start = indx + 1
>>> result
[['Mike', 'Angela', 'Bill'], ['Robert', 'Pam']]

Group elements of a list based on repetition of values

I am really new to Python and I am having a issue figuring out the problem below.
I have a list like:
my_list = ['testOne:100', 'testTwo:88', 'testThree:76', 'testOne:78', 'testTwo:88', 'testOne:73', 'testTwo:66', 'testThree:90']
And I want to group the elements based on the occurrence of elements that start with 'testOne'.
Expected Result:
new_list=[['testOne:100', 'testTwo:88', 'testThree:76'], ['testOne:78', 'testTwo:88'], ['testOne:73', 'testTwo:66', 'testThree:90']]
Just start a new list at every testOne.
>>> new_list = []
>>> for item in my_list:
if item.startswith('testOne:'):
new_list.append([])
new_list[-1].append(item)
>>> new_list
[['testOne:100', 'testTwo:88', 'testThree:76'], ['testOne:78', 'testTwo:88'], ['testOne:73', 'testTwo:66', 'testThree:90']]
Not a cool one-liner, but this works also with more general labels:
result = [[]]
seen = set()
for entry in my_list:
test, val = entry.split(":")
if test in seen:
result.append([entry])
seen = {test}
else:
result[-1].append(entry)
seen.add(test)
Here, we are keeping track of the test labels we've already seen in a set and starting a new list whenever we encounter a label we've already seen in the same list.
Alternatively, assuming the lists always start with testOne, you could just start a new list whenever the label is testOne:
result = []
for entry in my_list:
test, val = entry.split(":")
if test == "testOne":
result.append([entry])
else:
result[-1].append(entry)
It'd be nice to have an easy one liner, but I think it'd end up looking a bit too complicated if I tried that. Here's what I came up with:
# Create a list of the starting indices:
ind = [i for i, e in enumerate(my_list) if e.split(':')[0] == 'testOne']
# Create a list of slices using pairs of indices:
new_list = [my_list[i:j] for (i, j) in zip(ind, ind[1:] + [None])]
Not very sophisticated but it works:
my_list = ['testOne:100', 'testTwo:88', 'testThree:76', 'testOne:78', 'testTwo:88', 'testOne:73', 'testTwo:66', 'testThree:90']
splitting_word = 'testOne'
new_list = list()
partial_list = list()
for item in my_list:
if item.startswith(splitting_word) and partial_list:
new_list.append(partial_list)
partial_list = list()
partial_list.append(item)
new_list.append(partial_list)
joining the list into a string with delimiter |
step1="|".join(my_list)
splitting the listing based on 'testOne'
step2=step1.split("testOne")
appending "testOne" to the list elements to get the result
new_list=[[i for i in str('testOne'+i).split("|") if len(i)>0] for i in step2[1:]]

String matching in lists - Python

I have a list like below - from this i have filter the tables that begin with 'test:SF.AcuraUsage_' (string matching)
test:SF.AcuraUsage_20150311
test:SF.AcuraUsage_20150312
test:SF.AcuraUsage_20150313
test:SF.AcuraUsage_20150314
test:SF.AcuraUsage_20150315
test:SF.AcuraUsage_20150316
test:SF.AcuraUsage_20150317
test:SF.ClientUsage_20150318
test:SF.ClientUsage_20150319
test:SF.ClientUsage_20150320
test:SF.ClientUsage_20150321
I am using this for loop but not sure why it does not work:
for x in list:
if(x 'test:SF.AcuraUsage_'):
print x
I tried this out:
for x in list:
alllist = x
vehiclelist = [x for x in alllist if x.startswith('geotab-bigdata-test:StoreForward.VehicleInfo')]
Still i get the error ' dictionary object has no attribute startswith'.
You shouldn't name your list list, since it overrides the built-in type list
But, if you'd like to filter that list using Python, consider using this list comprehension:
acura = [x for x in list if x.startswith('test:SF.AcuraUsage')]
then, if you'd like to output it
for x in acura:
print(x)
List comprehensions are good for that.
Get a list with all the items that begin with 'test:SF.AcuraUsage_' :
new_list = [x for x in list if x.startswith('test:SF.AcuraUsage_' ')]
Or the items that do not begin with 'test:SF.AcuraUsage_' :
new_list = [x for x in list if not x.startswith('test:SF.AcuraUsage_' )]
using re module:
import re
for x in list:
ret = re.match('test:SF.AcuraUsage_(.*)',x)
if ret:
print(re.group())

Categories

Resources