Avoid lexicographic ordering of numerical values with Python min() max() - python

I have a script to pull random numbers from a set of values. However, it broke today because min() and max() sort values by lexicographic order (so 200 is considered greater than 10000). How can I avoid lexicographic order here? Len key is on the right track but not quite right. I couldn't find any other key(s) that would help.
data_set = 1600.csv, 2405.csv, 6800.csv, 10000.csv, 21005.csv
First try:
highest_value = os.path.splitext(max(data_set))[0]
lowest_value = os.path.splitext(min(data_set))[0]
returns: lowest_value = 10000 highest_value = 6800
Second try:
highest_value = os.path.splitext(max(data_set,key=len))[0]
lowest_value = os.path.splitext(min(data_set,key=len))[0]
returns: lowest_value = 1600 highest_value = 10000
Thanks.

You can use key to order by the numeric part of the file:
data_set = ['1600.csv', '2405.csv', '6800.csv', '10000.csv', '21005.csv']
highest = max(data_set, key=lambda x: int(x.split('.')[0]))
lowest = min(data_set, key=lambda x: int(x.split('.')[0]))
print(highest) # >> 21005.csv
print(lowest) # >> 1600.csv

You were close. Rather than using the result of splittext with the len function, use the int function instead:
>>> from os.path import splitext
>>> data_set = ['1600.csv', '2405.csv', '6800.csv', '10000.csv', '21005.csv']
>>> def convert_to_int(file_name):
return int(splitext(file_name)[0])
>>> min(data_set, key=convert_to_int)
'1600.csv'
>>> max(data_set, key=convert_to_int)
'21005.csv'
Of course, this solution assumes that your file name will consist solely of numerical values.

Related

How do I find find the index of the greatest integer in a list that contains integers and strings in Python?

Thats how the list looks like
incomes = ['Kozlowski', 52000, 'Kasprzak', 51000, 'Kowalski', 26000]
I want to print the biggest income and the surname of the person with that income (index of the income - 1)
You can try this:
index_of_highest_salary = incomes.index(np.max(incomes[1::2]))
You can create a dict from your data, then use max to find the key with the largest corresponding value.
>>> incomes = ['Kozlowski', 52000, 'Kasprzak', 51000, 'Kowalski', 26000]
>>> data = dict(zip(incomes[::2], incomes[1::2]))
>>> data
{'Kozlowski': 52000, 'Kasprzak': 51000, 'Kowalski': 26000}
>>> max(data.items(), key=lambda i: i[1])
('Kozlowski', 52000)
Then you don't need indexing and the data is more structured.
If your pattern is ["SURNAME_1", INCOME_1, "SURNAME_2", INCOME_2, ... ], then you can do this:
prices = incomes[1::2] # This will return all integers
names = incomes[::2] # This will return all surnames
max_price = max(prices)
max_price_index = prices.index(max_price)
person = names[max_price_index]
but you should really change it to dictionary, as it is easier to work with, and it's more efficient
Your title question is different to what you're asking, so here's an answer for your title:
If you don't want to use other libraries like numpy, you can do this:
index = 0
max_n = None
for i in range(len(incomes)):
element = incomes[i]
if type(element) == int or type(element) == float:
if max_n is None or element > max_n:
max_n = element
index = i
index will hold the index of the largest number of the list.
This answer assumes you want the index of the entry as stated in the title of your question.
Use enumerate to create data where index and value are combined.
incomes = ['Kozlowski', 52000, 'Kasprzak', 51000, 'Kowalski', 26000]
print(list(enumerate(incomes[1::2])))
This will give you [(0, 52000), (1, 51000), (2, 126000)].
We can now feed this data to max and use a key function that gives us the second entry of each tuple. When we get this tuple we can get the index from the first entry in the tuple. Since we left out every second element (the name) this index must be multiplied by 2.
incomes = ['Kozlowski', 52000, 'Kasprzak', 51000, 'Kowalski', 26000]
max_income = max(enumerate(incomes[1::2]), key=lambda x: x[1])
print(max_income[0] * 2)
If you want the index and the entries the code can be adjusted.
max_income = max(enumerate(zip(incomes[::2], incomes[1::2])), key=lambda x: x[1][1])
print(max_income)
This will give you a tuple with an index as the first entry and a tuple with name and income as the second entry. To map this index to your incomes list it will have to be multiplied by 2.

Find keys in the dictionary with the largest date by masks

How I can get the keys from the ftp_json dictionary with the largest date by mask from the daily_updated list?
daily_updated = ('kgrd', 'cvhd', 'metd')
ftp_json = {'kgrd0118.arj': 'Jan-18-2007',
'kgrd0623.arj': 'Jun-23-2005',
'kgrd0624.arj': 'Jun-24-2005',
'cvhd0629.ARJ': 'Jan-29-2021',
'cvhd1026.arj': 'Oct-26-2015',
'cvhd1125.ARJ': 'Nov-25-2019',
'cvhd0222.ARJ': 'Feb-22-2022',
'metd0228.ARJ': 'Feb-28-2022',
'metd0321.ARJ': 'Mar-26-2021',
}
result = ['kgrd0118.arj', 'cvhd0222.arj', 'metd0228.ARJ']
You can make advantage of the key parameter of the max (and min) built-in function to impose a ordering criterium. Before that you need to turn the string containing the dates into datetime objects which come along with their own ordering, __lt__ etc, implementation. Here the doc for the date formatting.
Notice that a minimum date object is needed, it will be used as a "fake" value to avoid interference from all other masks in the max-term search. I naturally fixed it as the minimum among of all dates.
import datetime
daily_updated = ('kgrd', 'cvhd', 'metd')
ftp_json = {'kgrd0118.arj': 'Jan-18-2007',
'kgrd0623.arj': 'Jun-23-2005',
'kgrd0624.arj': 'Jun-24-2005',
'cvhd0629.ARJ': 'Jan-29-2021',
'cvhd1026.arj': 'Oct-26-2015',
'cvhd1125.ARJ': 'Nov-25-2019',
'cvhd0222.ARJ': 'Feb-22-2022',
'metd0228.ARJ': 'Feb-28-2022',
'metd0321.ARJ': 'Mar-26-2021',
}
def date_formatter(mydate):
return datetime.datetime.strptime(mydate, '%b-%d-%Y').date()
# smallest date
day_zero = datetime.datetime.strptime(min(ftp_json.values(), key=lambda d: date_formatter(d)), '%b-%d-%Y').date()
# get the maximum for each mask
m = [max(ftp_json.items(), key=lambda pair: date_formatter(pair[1]) if pair[0].startswith(pattern) else day_zero) for pattern in daily_updated]
print([i for i, _ in m])
Output
['kgrd0118.arj', 'cvhd0222.ARJ', 'metd0228.ARJ']
EDIT
To keep it more readable (and not single-line-like), a decorator can be introduced that will be passed to the key parameter of max (min).
# ...
def date_formatter(mydate):
return datetime.datetime.strptime(mydate, '%b-%d-%Y').date()
# smallest date
day_zero = datetime.datetime.strptime(min(ftp_json.values(), key=lambda d: date_formatter(d)), '%b-%d-%Y').date()
# decorator containing the logic of the comparison criteria
def ordering(pattern):
def _wrapper(pair):
if pair[0].startswith(pattern):
# cast to date-object if the "mask"/pattern is correct
return date_formatter(pair[1])
else:
# return default smallest date-object -> will not influence the max-function
return day_zero
return _wrapper
# get the maximum for each mask
m = [max(ftp_json.items(), key=ordering(pattern)) for pattern in daily_updated]
This can no doubt be done more simply, but I think this example is a descriptive way to do this with the standard library.
from datetime import datetime
ftp_json = {
"kgrd0118.arj": "Jan-18-2007",
"kgrd0623.arj": "Jun-23-2005",
"kgrd0624.arj": "Jun-24-2005",
"cvhd0629.ARJ": "Jan-29-2021",
"cvhd1026.arj": "Oct-26-2015",
"cvhd1125.ARJ": "Nov-25-2019",
"cvhd0222.ARJ": "Feb-22-2022",
"metd0228.ARJ": "Feb-28-2022",
"metd0321.ARJ": "Mar-26-2021",
}
max_dates = {} # New dict for storing running maximums.
for k, v in ftp_json.items():
d = datetime.strptime(v, "%b-%d-%Y") # Use datetime for comparison.
# Here we return the previous tuple values if set for comparison.
# If they weren't set, do so now.
maxk, maxv, maxd = max_dates.setdefault(k[:4], (k, v, d))
if d > maxd: # Update the values is the current date is more recent.
max_dates[k[:4]] = (k, v, d)
# Validate we stored the correct values.
assert [v[0] for v in max_dates.values()] == [
"kgrd0118.arj",
"cvhd0222.ARJ",
"metd0228.ARJ",
]

Prepare my bigdata with Spark via Python

My 100m in size, quantized data:
(1424411938', [3885, 7898])
(3333333333', [3885, 7898])
Desired result:
(3885, [3333333333, 1424411938])
(7898, [3333333333, 1424411938])
So what I want, is to transform the data so that I group 3885 (for example) with all the data[0] that have it). Here is what I did in python:
def prepare(data):
result = []
for point_id, cluster in data:
for index, c in enumerate(cluster):
found = 0
for res in result:
if c == res[0]:
found = 1
if(found == 0):
result.append((c, []))
for res in result:
if c == res[0]:
res[1].append(point_id)
return result
but when I mapPartitions()'ed data RDD with prepare(), it seem to do what I want only in the current partition, thus return a bigger result than the desired.
For example, if the 1st record in the start was in the 1st partition and the 2nd in the 2nd, then I would get as a result:
(3885, [3333333333])
(7898, [3333333333])
(3885, [1424411938])
(7898, [1424411938])
How to modify my prepare() to get the desired effect? Alternatively, how to process the result that prepare() produces, so that I can get the desired result?
As you may already have noticed from the code, I do not care about speed at all.
Here is a way to create the data:
data = []
from random import randint
for i in xrange(0, 10):
data.append((randint(0, 100000000), (randint(0, 16000), randint(0, 16000))))
data = sc.parallelize(data)
You can use a bunch of basic pyspark transformations to achieve this.
>>> rdd = sc.parallelize([(1424411938, [3885, 7898]),(3333333333, [3885, 7898])])
>>> r = rdd.flatMap(lambda x: ((a,x[0]) for a in x[1]))
We used flatMap to have a key, value pair for every item in x[1] and we changed the data line format to (a, x[0]), the a here is every item in x[1]. To understand flatMap better you can look to the documentation.
>>> r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])))
We just grouped all key, value pairs by their keys and used tuple function to convert iterable to tuple.
>>> r2.collect()
[(3885, (1424411938, 3333333333)), (7898, (1424411938, 3333333333))]
As you said you can use [:150] to have first 150 elements, I guess this would be proper usage:
r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])[:150]))
I tried to be as explanatory as possible. I hope this helps.

get minimum and maximum values from a 'for in' loop

First post & I've probably got no business being here, but here goes...
How do I find the maximum and minimum values from the output of a 'for in' loop?
I've tried the min() and max() and get the following error...
TypeError: 'int' object is not iterable
here's my code...
import urllib2
import json
def printResults(data):
# Use the json module to load the string data into a dictionary
theJSON = json.loads(data)
# test bed for accessing the data
for i in theJSON["features"]:
t = i["properties"]["time"]
print t
def main():
# define a variable to hold the source URL
urlData = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.geojson"
# Open the URL and read the data
webUrl = urllib2.urlopen(urlData)
#print webUrl.getcode()
if (webUrl.getcode() == 200):
data = webUrl.read()
# print out our customized results
printResults(data)
else:
print "Received an error from server, cannot retrieve results " + str(webUrl.getcode())
if __name__ == "__main__":
main()
Any pointers will be greatly appreciated!
You can use min and max on iterables. Since you are looping through theJSON["features"], you can use:
print min(e["properties"]["time"] for e in theJSON["features"])
print max(e["properties"]["time"] for e in theJSON["features"])
You can also store the result in a variable, so you can use it later:
my_min = min(...)
my_max = max(...)
As #Sabyasachi commented you can also use:
print min(theJSON["features"], key = lambda x:x["properties"]["time"])
Here is an example of how you can manually keep track of a min and max.
minVal = 0
maxVal = 0
for i in yourJsonThingy:
if i < minVal:
minVal = i
if i > maxVal:
maxVal = i
You can't do this:
for i in yourJsonThingy:
maxVal = max(i)
Because i is just an integer and doesn't have a max
But you can perform those operations on a list of ints
maxVal = max(yourJsonThingy)
minVal = min(yourJsonThingy)
In the case that you only want to go through your iterable once, (say it's an expensive operation to do, and really that's the only reason you should do this, instead of doing max or min separately, but that said, the below is a performance improvement on calling both separately, see numbers below):
def max_min(iterable, key=None):
'''
returns a tuple of the max, min of iterable, optional function key
tuple items are None if iterable is of length 0
'''
it = iter(iterable)
_max = _min = next(it, None)
if key is None:
for i in it:
if i > _max:
_max = i
elif i < _min:
_min = i
else:
_max_key = _min_key = key(_max)
for i in it:
key_i = key(i)
if key_i > _max_key:
_max, _max_key = i, key_i
elif key_i < _min_key:
_min, _min_key = i, key_i
return _max, _min
usage:
>>> max_min(range(100))
(99, 0)
>>> max_min(range(100), key=lambda x: -x)
(0, 99)
To performance check:
>>> timeit.timeit('max(range(1000)), min(range(1000))', setup=setup)
70.95577674100059
>>> timeit.timeit('max_min(range(1000))', setup=setup)
65.00369232000958
Which is about a 9% improvement on calling both builtins, max and min, without a lambda, separately. With a lambda:
>>> timeit.timeit('max(range(1000), key=lambda x: -x),min(range(1000), key=lambda x: -x)', setup=setup)
294.17539755300095
>>> timeit.timeit('max_min(range(1000), key=lambda x: -x)', setup=setup)
208.95339999899443
Which is a more than 40% improvement on calling each separately with lambdas.

Group by max or min in a numpy array

I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data by grouping on id and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.
Is there a way I can avoid Python loops and do this in a vectorized manner?
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
In pure Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
A variation:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Based on #Bago's answer:
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
If pandas is installed:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
For example:
data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).
I should point out that the data_id doesn't actually need to be sorted.
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
I think this accomplishes what you're looking for:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.
This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.
The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result

Categories

Resources