Python get latest files from list - python

I have a list of files like this:
my_list=['l.txt','PPT_6_202008062343HLC.txt','PPT_6_202008070522HLC.txt','PPT_12_202008062343HLC.txt','PPT_12_202008070522HLC.txt']
and I want to have a final list with the latest that files that begins with ppt_6 and ppt_12 and keep the other elements items, like this:
final_list=
['PPT_6_202008070522HLC.txt', 'PPT_12_202008070522HLC.txt', 'l.txt']
right now I'm doing this:
from datetime import datetime
now = datetime.now()
new_arc=[]
time_6=[]
time_12=[]
for i in my_list:
if i[4:5]=='6':
time_6.append(i)
elif i[4:5]=='1':
time_12.append(i)
else:
new_arc.append(i)
time_6 = [max(t for t in time_6 if datetime.strptime(t[-15:-3], '%Y%m%d%H%M') < now)]
time_12 = [max(t for t in time_12 if datetime.strptime(t[-15:-3], '%Y%m%d%H%M') < now)]
final_list=time_6+time_12+new_arc
is there a better way of doing this ?

The datetime format into these filenames allows you not to use datetime functions, alphabetical order is enough.
You can remove all items matching the two patterns and finally append the most recent of them, which are the maximum (alphabetically) elements.
p1 = [x for x in my_list if x.startswith("PPT_6")]
p2 = [x for x in my_list if x.startswith("PPT_12")]
result = [x for x in my_list if x not in p1 and x not in p2]
result.append(max(p1))
result.append(max(p2))
print(result)

Since the file names already have a date order, you could simply sort on them. Then group by the prefix (PPT_6 and PPT_12). Finally get the top row from each group.
from itertools import groupby
#get prefix up to nth _
def split_nth(text, n):
grp = text.split('_')
return '_'.join(grp[:n])
my_list =['l.txt','PPT_6_202008062343HLC.txt','PPT_6_202008070522HLC.txt',
'PPT_12_202008062343HLC.txt','PPT_12_202008070522HLC.txt']
sorted_list = sorted(my_list[1:], reverse=True)
groups = groupby(sorted_list, key=lambda x: split_nth(x, 2))
result = [next(v) for _, v in groups]
result.append(my_list[0])

The best I could come up with was this:
import re
my_list = [
'l.txt','PPT_6_202008062343HLC.txt','PPT_6_202008070522HLC.txt',
'PPT_12_202008062343HLC.txt','PPT_12_202008070522HLC.txt'
]
patterns = (re.compile("PPT_6"), re.compile("PPT_12"))
final_list = [sorted(list(filter(pattern.match, problem_list)))[0]
for pattern in patterns]
final_list += list(filter(re.compile("[^PPT]").match, problem_list))
Depending on how many file names you're going to be working with, I don't think it should be too bad.

Related

Remove file name duplicates in a list

I have a list l:
l = ['Abc.xlsx', 'Wqe.csv', 'Abc.csv', 'Xyz.xlsx']
In this list, I need to remove duplicates without considering the extension. The expected output is below.
l = ['Wqe.csv', 'Abc.csv', 'Xyz.xlsx']
I tried:
l = list(set(x.split('.')[0] for x in l))
But getting only unique filenames without extension
How could I achieve it?
You can use a dictionary comprehension that uses the name part as key and the full file name as the value, exploiting the fact that dict keys must be unique:
>>> list({x.split(".")[0]: x for x in l}.values())
['Abc.csv', 'Wqe.csv', 'Xyz.xlsx']
If the file names can be in more sophisticated formats (such as with directory names, or in the foo.bar.xls format) you should use os.path.splitext:
>>> import os
>>> list({os.path.splitext(x)[0]: x for x in l}.values())
['Abc.csv', 'Wqe.csv', 'Xyz.xlsx']
If the order of the end result doesn't matter, we could split each item on the period. We'll regard the first item in the list as the key and then keep the item if the key is unique.
oldList = l
setKeys = set()
l = []
for item in oldList:
itemKey = item.split(".")[0]
if itemKey in setKeys:
pass
else:
setKeys.add(itemKey)
l.append(item)
Try this
l = ['Abc.xlsx', 'Wqe.csv', 'Abc.csv', 'Xyz.xlsx']
for x in l:
name = x.split('.')[0]
find = 0
for index,d in enumerate(l, start=0):
txt = d.split('.')[0]
if name == txt:
find += 1
if find > 1:
l.pop(index)
print(l)
#Selcuk Definitely the best solution, unfortunately I don't have enough reputation to vote you answer.
But I would rather use el[:el.rfind('.')] as my dictionary key than os.path.splitext(x)[0] in order to handle the case where we have sophisticated formats in the name. that will give something like this:
list({x[:x.rfind('.')]: x for x in l}.values())

Python - filter list from another other list with condition

list1 = ['/mnt/1m/a_pre.geojson','/mnt/2m/b_pre.geojson']
list2 = ['/mnt/1m/a_post.geojson']
I have multiple lists and I want to find all the elements of list1 which do not have entry in list2 with a filtering condition.
The condition is it should match 'm' like 1m,2m.. and name of geojson file excluding 'pre or post' substring.
For in e.g. list1 '/mnt/1m/a_pre.geojson' is processed but '/mnt/2m/b_pre.geojson' is not so the output should have a list ['/mnt/2m/b_pre.geojson']
I am using 2 for loops and then splitting the string which I am sure is not the only one and there might be easier way to do this.
for i in list1:
for j in list2:
pre_tile = i.split("/")[-1].split('_pre', 1)[0]
post_tile = j.split("/")[-1].split('_post', 1)[0]
if pre_tile == post_tile:
...
I believe you have similar first part of the file paths. If so, you can try this:
list1 = ['/mnt/1m/a_pre.geojson','/mnt/2m/b_pre.geojson']
list2 = ['/mnt/1m/a_post.geojson']
res = [x for x in list1 if x[:7] not in [y[:7] for y in list2]]
res:
['/mnt/2m/b_pre.geojson']
If I understand you correctly, using a regular expression to do this kind of string manipulation can be fast and easy.
Additionally, to do multiple member-tests in list2, it's more efficient to convert the list to a set.
import re
list1 = ['/mnt/1m/a_pre.geojson', '/mnt/2m/b_pre.geojson']
list2 = ['/mnt/1m/a_post.geojson']
pattern = re.compile(r'(.*?/[0-9]m/.*?)_pre.geojson')
set2 = set(list2)
result = [
m.string
for m in map(pattern.fullmatch, list1)
if m and f"{m[1]}_post.geojson" not in set2
]
print(result)

How to parse the month from a list of strings that contain dates and other strings inside the list?

I have list of lists
list1 = [['0', '2015-12-27', '64236.62'],
['1', '2015-12-12', '65236.12'],
... ]
This list contains data from 2015 to 2018
how to figure out the value for each month?
So, I would like to create a dictionary with data for each month for a certain year.
I have tried like this:
import re
years_month_count = {}
for i in list1:
match = re.search("[2][0][1][5-8]-[0-9][0-9]", i[1])
if match not in years_month_count:
years_month_count[match] = 0
else:
years_month_count[match] += float(i[2])
Using str.rsplit and a collections.defaultdict, you can do the following:
from collections import defaultdict
list1 = [['0', '2015-12-27', '64236.62'],
['1', '2015-11-12', '65236.12'],
['2', '2015-12-27', '64236.62']]
d = defaultdict(float)
for x in list1:
d[x[1].rsplit('-', 1)[0]] += float(x[2])
The output will be a dict like:
{'2015-12': 128473.24, '2015-11': 65236.12}
You should not use the else clause, since you always want to add the value, even for the first item of a month.
Also, you don't need a regular expression. If all the datestamps are well formed, you you can simply use string slicing.
years_month_count = {}
for _, date, value in list1:
month = date[:7]
years_month_count[month] = float(value) + years_month_count.get(month, 0)
create your dictionary and initialize it to 0
d = {i:0 for i in range(1,13)}
loop through your list, split string to get month, and add value to dictionary.
for l in list1:
splt = l[1].split("-")
d[int(splt[1])] += float(l[2])

Sensing Two Consecutive Strings in String

If I have aaabbbccc, I'd like to change them in to a3b3c3.
I am using if statement for this.. but it looks too inefficient.
Maybe Regex would be helpful, but regex for searching only the consecutives is possible?
if I have aaabbbcccaaa then I'd like to change them a3b3c3a3 list this.. which means the algorithm only search the "consecutives and count them" change into integer.
Any hint to proceed would be appreciated.
def comp(string):
index = []
for i in range(len(string)):
try:
if string[i] is not string[i+1]:
index.append(i)
except:
pass
first = string[index[0]] + str(index[0]+1)
print(first)
message_comp = [first]
for i in range(1, len(message_comp)):
message_comp.append(message[index[i]]*(index[i-1]+1))
final = ''.join(message_comp)
return final
itertools.groupby:
Make an iterator that returns consecutive keys and groups from the iterable
import itertools
x = 'aaabbbcccaaa'
groups = [i + str(len([*j])) for i, j in itertools.groupby(x)]
# ['a3', 'b3', 'c3', 'a3']
join to finish up:
''.join(groups)
# a3b3c3a3
If needed, replace to remove 1:
''.join(groups).replace('1', '') instead of ''.join(groups)
Maybe itertools groupby?
from itertools import groupby
s = "aaabbbcccaaa"
groups = groupby(s)
a = [(label, sum(1 for _ in group)) for label, group in groups]
b = [i for sub in a for i in sub]
print("".join(map(str,b)))
output: a3b3c3a3

Sorting a list based on another shorter list with incomplete values

I have a list of file paths which I need to order in a specific way prior to reading and processing the files. The specific way is defined by a smaller list which contains only some file names, but not all of them. All other file paths which are not listed in presorted_list need to stay in the order they had previously.
Examples:
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
expected_list = ['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
I've found some relating posts:
Sorting list based on values from another list?
How to sort a list according to another list?
But as far as I can tell the questions and answers always rely on two lists of the same length which I don't have (which results in errors like ValueError: 'bar_foo' is not in list) or a presorted list which needs to contain all possible values which I can't provide.
My Idea:
I've come up with a solution which seems to work but I'm unsure if this is a good way to approach the problem:
import os
import re
EXCPECTED_LIST = ['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
PRESORTED_LIST = ["foo_baz", "foo"]
def sort_function(item, len_list):
# strip path and unwanted parts
filename = re.sub(r"[\(\[].*?[\)\]]", "", os.path.basename(item)).split('.')[0]
if filename in PRESORTED_LIST:
return PRESORTED_LIST.index(filename)
return len_list
def main():
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv',]
list_length = len(some_list)
sorted_list = sorted(some_list, key=lambda x: sort_function(x, list_length))
assert sorted_list == EXCPECTED_LIST
if __name__ == "__main__":
main()
Are there other (shorter, more pythonic) ways of solving this problem?
Here is how I think I would do it:
import re
from collections import OrderedDict
from itertools import chain
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
expected_list = ['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
def my_sort(lst, presorted_list):
rgx = re.compile(r"^(.*/)?([^/(.]*)(\(.*\))?(\.[^.]*)?$")
d = OrderedDict((n, []) for n in presorted_list)
d[None] = []
for p in some_list:
m = rgx.match(p)
n = m.group(2) if m else None
if n not in d:
n = None
d[n].append(p)
return list(chain.from_iterable(d.values()))
print(my_sort(some_list, presorted_list) == expected_list)
# True
An easy implementation is to add some sentinels to the lines before sorting. So there is no need for specific ordering. Also regex may be avoid if all filenames respect the pattern you gave:
for n,file1 in enumerate(presorted_list):
for m,file2 in enumerate(some_list):
if '/'+file1+'.' in file2 or '/'+file1+'(' in file2:
some_list[m] = "%03d%03d:%s" % (n, m, file2)
some_list.sort()
some_list = [file.split(':',1)[-1] for file in some_list]
print(some_list)
Result:
['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
Let me think. It is a unique problem, I'll try to suggest a solution
only_sorted_elements = filter(lambda x:x.rpartition("/")[-1].partition(".")[0] in presorted_list , some_list)
only_sorted_elements.sort(key = lambda x:presorted_list.index(x.rpartition("/")[-1].partition(".")[0]))
expected_list = []
count = 0
for ind, each_element in enumerate(some_list):
if each_element not in presorted_list:
expected_list.append(each_element)
else:
expected_list[ind].append(only_sorted_elements[count])
count += 1
Hope this solves your problem.
I first filter for only those elements which are there in presorted_list,
then I sort those elements according to its order in presorted_list
Then I iterate over the list and append accordingly.
Edited :
Changed index parameters from filename with path to exact filename.
This will retain the original indexes of those files which are not in presorted list.
EDITED :
The new edited code will change the parameters and gives sorted results first and unsorted later.
some_list = ['path/to/bar_foo.csv',
'path/to/foo_baz.csv',
'path/to/foo_bar(ignore_this).csv',
'path/to/foo(ignore_this).csv',
'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
only_sorted_elements = filter(lambda x:x.rpartition("/")[-1].partition("(")[0].partition(".")[0] in presorted_list , some_list)
unsorted_all = filter(lambda x:x.rpartition("/")[-1].partition("(")[0].partition(".")[0] not in presorted_list , some_list)
only_sorted_elements.sort(key = lambda x:presorted_list.index(x.rpartition("/")[-1].partition("(")[0].partition(".")[0]))
expected_list = only_sorted_elements + unsorted_all
print expected_list
Result :
['path/to/foo_baz.csv',
'other/path/to/foo_baz.csv',
'path/to/foo(ignore_this).csv',
'path/to/bar_foo.csv',
'path/to/foo_bar(ignore_this).csv']
Since python's sort is already stable, you only need to provide it with a coarse grouping for the sort key.
Given the specifics of your sorting requirements this is better done using a function. For example:
def presort(presorted):
def sortkey(name):
filename = name.split("/")[-1].split(".")[0].split("(")[0]
if filename in presorted:
return presorted.index(filename)
return len(presorted)
return sortkey
sorted_list = sorted(some_list,key=presort(['foo_baz', 'foo']))
In order to keep the process generic and simple to use, the presorted_list should be provided as a parameter and the sort key function should use it to produce the grouping keys. This is achieved by returning a function (sortkey) that captures the presorted list parameter.
This sortkey() function returns the index of the file name in the presorted_list or a number beyond that for unmatched file names. So, if you have 2 names in the presorted_list, they will group the corresponding files under sort key values 0 and 1. All other files will be in group 2.
The conditions that you use to determine which part of the file name should be found in presorted_list are somewhat unclear so I only covered the specific case of the opening parenthesis. Within the sortkey() function, you can add more sophisticated parsing to meet your needs.

Categories

Resources