How I can get the keys from the ftp_json dictionary with the largest date by mask from the daily_updated list?
daily_updated = ('kgrd', 'cvhd', 'metd')
ftp_json = {'kgrd0118.arj': 'Jan-18-2007',
'kgrd0623.arj': 'Jun-23-2005',
'kgrd0624.arj': 'Jun-24-2005',
'cvhd0629.ARJ': 'Jan-29-2021',
'cvhd1026.arj': 'Oct-26-2015',
'cvhd1125.ARJ': 'Nov-25-2019',
'cvhd0222.ARJ': 'Feb-22-2022',
'metd0228.ARJ': 'Feb-28-2022',
'metd0321.ARJ': 'Mar-26-2021',
}
result = ['kgrd0118.arj', 'cvhd0222.arj', 'metd0228.ARJ']
You can make advantage of the key parameter of the max (and min) built-in function to impose a ordering criterium. Before that you need to turn the string containing the dates into datetime objects which come along with their own ordering, __lt__ etc, implementation. Here the doc for the date formatting.
Notice that a minimum date object is needed, it will be used as a "fake" value to avoid interference from all other masks in the max-term search. I naturally fixed it as the minimum among of all dates.
import datetime
daily_updated = ('kgrd', 'cvhd', 'metd')
ftp_json = {'kgrd0118.arj': 'Jan-18-2007',
'kgrd0623.arj': 'Jun-23-2005',
'kgrd0624.arj': 'Jun-24-2005',
'cvhd0629.ARJ': 'Jan-29-2021',
'cvhd1026.arj': 'Oct-26-2015',
'cvhd1125.ARJ': 'Nov-25-2019',
'cvhd0222.ARJ': 'Feb-22-2022',
'metd0228.ARJ': 'Feb-28-2022',
'metd0321.ARJ': 'Mar-26-2021',
}
def date_formatter(mydate):
return datetime.datetime.strptime(mydate, '%b-%d-%Y').date()
# smallest date
day_zero = datetime.datetime.strptime(min(ftp_json.values(), key=lambda d: date_formatter(d)), '%b-%d-%Y').date()
# get the maximum for each mask
m = [max(ftp_json.items(), key=lambda pair: date_formatter(pair[1]) if pair[0].startswith(pattern) else day_zero) for pattern in daily_updated]
print([i for i, _ in m])
Output
['kgrd0118.arj', 'cvhd0222.ARJ', 'metd0228.ARJ']
EDIT
To keep it more readable (and not single-line-like), a decorator can be introduced that will be passed to the key parameter of max (min).
# ...
def date_formatter(mydate):
return datetime.datetime.strptime(mydate, '%b-%d-%Y').date()
# smallest date
day_zero = datetime.datetime.strptime(min(ftp_json.values(), key=lambda d: date_formatter(d)), '%b-%d-%Y').date()
# decorator containing the logic of the comparison criteria
def ordering(pattern):
def _wrapper(pair):
if pair[0].startswith(pattern):
# cast to date-object if the "mask"/pattern is correct
return date_formatter(pair[1])
else:
# return default smallest date-object -> will not influence the max-function
return day_zero
return _wrapper
# get the maximum for each mask
m = [max(ftp_json.items(), key=ordering(pattern)) for pattern in daily_updated]
This can no doubt be done more simply, but I think this example is a descriptive way to do this with the standard library.
from datetime import datetime
ftp_json = {
"kgrd0118.arj": "Jan-18-2007",
"kgrd0623.arj": "Jun-23-2005",
"kgrd0624.arj": "Jun-24-2005",
"cvhd0629.ARJ": "Jan-29-2021",
"cvhd1026.arj": "Oct-26-2015",
"cvhd1125.ARJ": "Nov-25-2019",
"cvhd0222.ARJ": "Feb-22-2022",
"metd0228.ARJ": "Feb-28-2022",
"metd0321.ARJ": "Mar-26-2021",
}
max_dates = {} # New dict for storing running maximums.
for k, v in ftp_json.items():
d = datetime.strptime(v, "%b-%d-%Y") # Use datetime for comparison.
# Here we return the previous tuple values if set for comparison.
# If they weren't set, do so now.
maxk, maxv, maxd = max_dates.setdefault(k[:4], (k, v, d))
if d > maxd: # Update the values is the current date is more recent.
max_dates[k[:4]] = (k, v, d)
# Validate we stored the correct values.
assert [v[0] for v in max_dates.values()] == [
"kgrd0118.arj",
"cvhd0222.ARJ",
"metd0228.ARJ",
]
Related
A list and, a dictionary of dictionary, as below, and I want to do some query.
Every element in the list is a people's name and a query_date (in string format)
The dictionary has keys of people’s names, its value is a dictionary that using a announcement_date (in string format) as a key, and bonus as value.
From the list given people’s name and query_date, I want to find out the people’s bonus, that’s when the announcement_date is the closest (and earlier than) the query_date.
For example, "Mike 20191022" shall return Mike’s bonus announced in 20190630 (i.e. 105794.62).
What I have tried is, to find out the differences of every announcement_date, and compare them with the query_date. From the smallest difference, I got an index, and use the index to return the correspondent bonus.
import numpy as np
to_do = ["Mike 20191022"]
bonus = {'Mike': {'20200630': '105794.62', '20191231': '105794.62', '20190630': '105794.62', '20181231': '105794.62', '20180630': '95122.25', '20171231': '95122.25', '20170630': '95122.25'}}
for ox in to_do:
to_do_people = ox.split(' ')[0]"
to_do_date = ox.split(' ')[1]"
for key, s_dict in bonus.items():"
if to_do_people == key:"
tem_list = []"
for k, v in (s_dict.items()):"
tem_list.append(int(to_do_date) - int(k))"
idx = np.argmin(tem_list)"
print (ox + '#Bonus#' + list(s_dict.keys())[idx] + '#' + list(s_dict.values())[idx])"
However it doesn't work. The output is:
Mike 20191022#Bonus#20200630#105794.62
What went wrong and how I can correct it? Thank you.
Actually, you should append the abs value.
since min(-1000, 100, 0) is -1000, but what you want is 0.
tem_list.append(abs(int(to_do_date) - int(k)))
which gives the output as
Mike 20191022#Bonus#20191231#105794.62
But that won't solve your problem that your result date should be before the query date. This problem can be solved easily by setting all the dates after query date a very big number.
tem_list.append(abs(int(to_do_date) - int(k)) if (int(to_do_date)>=int(k)) else float('inf'))
What it does is, it sets infinity for all the dates that come after the query date.
So, the solution becomes
import numpy as np
to_do = ["Mike 20191022"]
bonus = {'Mike': {'20200630': '105794.62', '20191231': '105794.62', '20190630': '105794.62', '20181231': '105794.62', '20180630': '95122.25', '20171231': '95122.25', '20170630': '95122.25'}}
for ox in to_do:
to_do_people = ox.split(' ')[0]
to_do_date = ox.split(' ')[1]
for key, s_dict in bonus.items():
if to_do_people == key:
tem_list = []
for k, v in (s_dict.items()):
tem_list.append(abs(int(to_do_date) - int(k)) if (int(to_do_date)>=int(k)) else float('inf'))
idx = np.argmin(tem_list)
print (ox + '#Bonus#' + list(s_dict.keys())[idx] + '#' + list(s_dict.values())[idx])
output
Mike 20191022#Bonus#20190630#105794.62
Since you are importing numpy, use numpy.searchsorted:
>>> for query in to_do:
name, date = query.split()
keys = sorted(map(int,bonus[name].keys()))
ix = np.searchsorted(keys, int(date), side='right') - 1
print(f"{name} {date}#bonus#{keys[ix]}#{bonus[name][str(keys[ix])]}")
Mike 20191022#bonus#20190630#105794.62
EXPLANATION:
For first iteration, query == "Mike 20191022"
>>> name, date = query.split()
>>> name
"Mike"
>>> date
"20191022"
# then we sort the keys of the dict:
>>> sorted(map(int,bonus[name].keys()))
[20170630, 20171231, 20180630, 20181231, 20190630, 20191231, 20200630]
# Now we search for the index of the first date >= query_date
>>> np.searchsorted(keys, int(date), side='right')
5
# So 5th index is first date >= query_date
# If we check, we will find: keys[5] == 20191231 > 20191022
# However, we want the date before that, so subtract 1 from index
>>> ix = np.searchsorted(keys, int(date), side='right') - 1
>>> ix
4
# Now we first get the required date:
>>> keys[ix] # == keys[4]
20190630
# And then we access the bonus:
>>> bonus[name][str(keys[ix])]
"105794.62"
Now we print it using f-strings
I have a use case where I have a list of Java maven semantic ranges trying to find the list of versions that I can exclude in my application.
Example:
SemVersion1 = "[,1.8.8.1)"
SemVersion2 = "[,1.8.8.2)"
SemVersion3 = "[,1.8.8.3)"
SemVersion4 = "[1.0.0, 1.4.4)"
SemVersion5 = "[,1.3.11),[,1.7.0), [,1.8.0)"
How to filter the various SemVersion to only keep the one that includes all other versions. In this case, SemVersion3 will be picked up because it includes all the versions.
Thinking about something like this in python:
output = []
for a in SemVersions:
is_redundent = False
for b in a:
if affected_versions[a].issuperset(affected_versions[b]):
is_redundent = True
break
if not is_redundent:
output.append(b)
The problem is that I would need to inflate the SemVersions to get the affected_versions, is there an easier way to do this.
Steps:
convert your ranges into a more strict notation to eliminate ',1.2.3.4' and '1.0,2.3.4' (we need exact lower bounds for the last step)
convert your strings into actual Version objects for ease of comparisons (you could also do some conversion to numeral tuples but they break on versions containing non-numeric characters)
from packaging.version import Version
from functools import cmp_to_key
from collections import defaultdict
def adj_lengths(s, c=3):
"""Add missing .0 to a version up to a predefined count '.'-count.
Example: '9.0.2' with c == 3 -> '9.0.2.0'"""
cnt = s.count(".")
if c>cnt:
s = s + ".0"*(c-cnt)
return s
def split_mult(s):
"""Make versions more precise by replacing implicit range starts.
Uses adj_lengths to add missing .0 to equalize version lengths.
Handles splitting multiple given ranges as well.
Returns iterable.
Example: '[,1.3.11),[,1.7.0), [,1.8.0)'
-> (['0.0.0.0', '1.3.11.0'],['0.0.0.0', '1.7.0.0'], ['0.0.0.0', '1.8.0.0'])"""
s = s.replace("[,",f"[0,")
s1 = [ adj_lengths(b) for b in (t.strip("[()] ") for t in s.split(","))]
yield from [ s1[i:i+2] for i in range(0,len(s1),2)]
def make_version(l):
"""Transform text-list into Version-tuple."""
return ( Version(l[0]), Version(l[1]) )
Program:
vers = ["[,1.8.8.1)",
"[,1.8.8.2)",
"[,1.8.8.3)",
"[1.0.0, 1.4.4)",
"[,1.3.11),[,1.7.0), [,1.8.0)"]
# preprocessing to make them nicer
vers2 = []
for tmp in (split_mult(a) for a in vers) :
vers2.extend( (make_version(k) for k in tmp) )
print(vers2)
# bring into order
vers2.sort()
# use the lower bound as key - append alle upper bounds to a defaultdict(list)
d = defaultdict(list)
for fr,to in vers2:
d[fr].append(to)
# simplify lower bound:[list of upper bounds] to (lower bound, max(upper bound list values))
vers3 = [ (k,max(v)) for k,v in d.items()]
# eliminate range that lie inside the 1st elements range
for item in vers3[1:][:]:
if item[0] >= vers3[0][0] and item[1] <= vers3[0][1]:
vers3.remove(item)
print(vers3)
Output:
[(<Version('0.0.0.0')>, <Version('1.8.8.3')>)]
If you have more then one resulting range you would have to do the last step for every element - not just the first one, f.e. when your data in the last step is like:
[(<Version('0.0.0.0')>, <Version('1.8.8.3')>),
(<Version('1.0.0.0')>, <Version('2.8.8.3')>),
(<Version('1.2.0.0')>, <Version('1.8.8.3')>), ] # last elem inside before-last elem
I have a script to pull random numbers from a set of values. However, it broke today because min() and max() sort values by lexicographic order (so 200 is considered greater than 10000). How can I avoid lexicographic order here? Len key is on the right track but not quite right. I couldn't find any other key(s) that would help.
data_set = 1600.csv, 2405.csv, 6800.csv, 10000.csv, 21005.csv
First try:
highest_value = os.path.splitext(max(data_set))[0]
lowest_value = os.path.splitext(min(data_set))[0]
returns: lowest_value = 10000 highest_value = 6800
Second try:
highest_value = os.path.splitext(max(data_set,key=len))[0]
lowest_value = os.path.splitext(min(data_set,key=len))[0]
returns: lowest_value = 1600 highest_value = 10000
Thanks.
You can use key to order by the numeric part of the file:
data_set = ['1600.csv', '2405.csv', '6800.csv', '10000.csv', '21005.csv']
highest = max(data_set, key=lambda x: int(x.split('.')[0]))
lowest = min(data_set, key=lambda x: int(x.split('.')[0]))
print(highest) # >> 21005.csv
print(lowest) # >> 1600.csv
You were close. Rather than using the result of splittext with the len function, use the int function instead:
>>> from os.path import splitext
>>> data_set = ['1600.csv', '2405.csv', '6800.csv', '10000.csv', '21005.csv']
>>> def convert_to_int(file_name):
return int(splitext(file_name)[0])
>>> min(data_set, key=convert_to_int)
'1600.csv'
>>> max(data_set, key=convert_to_int)
'21005.csv'
Of course, this solution assumes that your file name will consist solely of numerical values.
I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data by grouping on id and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.
Is there a way I can avoid Python loops and do this in a vectorized manner?
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
In pure Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
A variation:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Based on #Bago's answer:
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
If pandas is installed:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
For example:
data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).
I should point out that the data_id doesn't actually need to be sorted.
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
I think this accomplishes what you're looking for:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.
This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.
The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result
I have an ordered (i.e. sorted) list that contains dates sorted (as datetime objects) in ascending order.
I want to write a function that iterates through this list and generates another list of the first available dates for each month.
For example, suppose my sorted list contains the following data:
A = [
'2001/01/01',
'2001/01/03',
'2001/01/05',
'2001/02/04',
'2001/02/05',
'2001/03/01',
'2001/03/02',
'2001/04/10',
'2001/04/11',
'2001/04/15',
'2001/05/07',
'2001/05/12',
'2001/07/01',
'2001/07/10',
'2002/03/01',
'2002/04/01',
]
The returned list would be
B = [
'2001/01/01',
'2001/02/04',
'2001/03/01',
'2001/04/10',
'2001/05/07',
'2001/07/01',
'2002/03/01',
'2002/04/01',
]
The logic I propose would be something like this:
def extract_month_first_dates(input_list, start_date, end_date):
#note: start_date and end_date DEFINITELY exist in the passed in list
prev_dates, output = [],[] # <- is this even legal?
for (curr_date in input_list):
if ((curr_date < start_date) or (curr_date > end_date)):
continue
curr_month = curr_date.date.month
curr_year = curr_date.date.year
date_key = "{0}-{1}".format(curr_year, curr_month)
if (date_key in prev_dates):
continue
else:
output.append(curr_date)
prev_dates.append(date_key)
return output
Any comments, suggestions? - can this be improved to be more 'Pythonic' ?
>>> import itertools
>>> [min(j) for i, j in itertools.groupby(A, key=lambda x: x[:7])]
['2001/01/01', '2001/02/04', '2001/03/01', '2001/04/10', '2001/05/07', '2001/07/01', '2002/03/01', '2002/04/01']
Searching lists is a O(n) operation. I think you can simply check whether the key is new:
def extract_month_first_dates(input_list):
output = []
last_key = None
for curr_date in input_list:
date_key = curr_date.date.month, curr_date.date.year # no string key required
if date_key != last_key:
output.append(curr_date)
last_key = date_key
return output
Here is a simple solution in classic python i.e. no itertools ;) and self explanatory
visited = {}
B = []
for a in A:
month = a[:7]
if month not in visited:
B.append(a)
visited[month] = 1
print B
Ouput:
['2001/01/01', '2001/02/04', '2001/03/01', '2001/04/10', '2001/05/07', '2001/07/01', '2002/03/01', '2002/04/01']