How to find Median [duplicate] - python

This question already has answers here:
Finding median of list in Python
(28 answers)
Closed 6 years ago.
I have data like this.
Ram,500
Sam,400
Test,100
Ram,800
Sam,700
Test,300
Ram,900
Sam,800
Test,400
What is the shortest way to fine the "median" from above data.
My result should be something like...
Median = 1/2(n+1), where n is the number of data values in the sample.
Test 500
Sam 700
Ram 800

Python 3.4 includes statistics built-in, so you can use the method statistics.median:
>>> from statistics import median
>>> median([1, 3, 5])
3

Use numpy's median function.

Its a little unclear how your data is actually represented, so I've assumed it is a list of tuples:
data = [('Ram',500), ('Sam',400), ('Test',100), ('Ram',800), ('Sam',700),
('Test',300), ('Ram',900), ('Sam',800), ('Test',400)]
from collections import defaultdict
def median(mylist):
sorts = sorted(mylist)
length = len(sorts)
if not length % 2:
return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
return sorts[length / 2]
data_dict = defaultdict(list)
for el in data:
data_dict[el[0]].append(el[1])
print [(key,median(val)) for key, val in data_dict.items()]
print median([5,2,4,3,1])
print median([5,2,4,3,1,6])
#output:
[('Test', 300), ('Ram', 800), ('Sam', 700)]
3
3.5
The function median returns the median from a list. If there are an even number of entries it takes the middle value of the middle two entries (this is standard).
I've used defaultdict to create a dict keyed by your data and their values, which is a more useful representation of your data.

Check this out:
def median(lst):
even = (0 if len(lst) % 2 else 1) + 1
half = (len(lst) - 1) / 2
return sum(sorted(lst)[half:half + even]) / float(even)
Note:
sorted(lst) produces a sorted copy of lst;
sum([1]) == 1;

Easiest way to get the median of a list with integer data:
x = [1,3,2]
print "The median of x is:",sorted(x)[len(x)//2]

I started with user3100512's answer and quickly realized it doesn't work for an even number of items. I added some conditionals to it to compute the median.
def median(x):
if len(x)%2 != 0:
return sorted(x)[len(x)/2]
else:
midavg = (sorted(x)[len(x)/2] + sorted(x)[len(x)/2-1])/2.0
return midavg
median([4,5,6,7])
should return 5.5

Related

Finding similar numbers in a list and getting the average

I currently have the numbers above in a list. How would you go about adding similar numbers (by nearest 850) and finding average to make the list smaller.
For example I have the list
l = [2000,2200,5000,2350]
In this list, i want to find numbers that are similar by n+500
So I want all the numbers similar by n+500 which are 2000,2200,2350 to be added and divided by the amount there which is 3 to find the mean. This will then replace the three numbers added. so the list will now be l = [2183,5000]
As the image above shows the numbers in the list. Here I would like the numbers close by n+850 to all be selected and the mean to be found
It seems that you look for a clustering algorithm - something like K-means.
This algorithm is implemented in scikit-learn package
After you find your K means, you can count how many of your data were clustered with that mean, and make your computations.
However, it's not clear in your case what is K. You can try and run the algorithm for several K values until you get your constraints (the n+500 distance between the means)
You can use:
import numpy as np
l = np.array([2000,2200,5000,2350])
# find similar numbers (that are within each 500 fold)
similar = l // 500
# for each similar group get the average and convert it to integer (as in the desired output)
new_list = [np.average(l[similar == num]).astype(int) for num in np.unique(similar)]
print(new_list)
Output:
[2183, 5000]
Step 1:
list = [5620.77978515625,
7388.43017578125,
7683.580078125,
8296.6513671875,
8320.82421875,
8557.51953125,
8743.5,
9163.220703125,
9804.7939453125,
9913.86328125,
9940.1396484375,
9951.74609375,
10074.23828125,
10947.0419921875,
11048.662109375,
11704.099609375,
11958.5,
11964.8232421875,
12335.70703125,
13103.0,
13129.529296875,
16463.177734375,
16930.900390625,
17712.400390625,
18353.400390625,
19390.96484375,
20089.0,
34592.15625,
36542.109375,
39478.953125,
40782.078125,
41295.26953125,
42541.6796875,
42893.58203125,
44578.27734375,
45077.578125,
48022.2890625,
52535.13671875,
58330.5703125,
61597.91796875,
62757.12890625,
64242.79296875,
64863.09765625,
66930.390625]
Step 2:
seen = [] #to log used indices pairs
diff_dic = {} #to record indices and diff
for i,a in enumerate(list):
for j,b in enumerate(list):
if i!=j and (i,j)[::-1] not in seen:
seen.append((i,j))
diff_dic[(i,j)] = abs(a-b)
keys = []
for ind, diff in diff_dic.items():
if diff <= 850:
keys.append(ind)
uniques_k = [] #to record unique indices
for pair in keys:
for key in pair:
if key not in uniques_k:
uniques_k.append(key)
import numpy as np
list_arr = np.array(list)
nearest_avg = np.mean(list_arr[uniques_k])
list_arr = np.delete(list_arr, uniques_k)
list_arr = np.append(list_arr, nearest_avg)
list_arr
output:
array([ 5620.77978516, 34592.15625, 36542.109375, 39478.953125, 48022.2890625, 52535.13671875, 58330.5703125 , 61597.91796875, 62757.12890625, 66930.390625 , 20566.00205365])
You just need a conditional list comprehension like this:
l = [2000,2200,5000,2350]
n = 2000
a = [ (x) for x in l if ((n -250) < x < (n + 250)) ]
Then you can average with
np.mean(a)
or whatever method you prefer.

Python: How do you add a math forumla for all elements in two lists?

I have converted three columns from an Excel document to three lists in Python.
I now wish to make a function, where I loop through all three lists and insert items from each list into a formula.
Example:
list1[1] + list2[1] / list3[1]
There are over 3000 items in all 3 lists, so having to write down a formula for every single item would take forever, so when I want the function, I want the program to automatically go from
list1[1] + list2[1] / list3[1]
to
list1[2] + list2[2] / list3[2],
then to
list1[3] + list2[3] / list3[3]
and so on.
How can I accomplish this?
Here is the (unfinished) code that I wrote so far.
df = pd.read_excel(r'C:\Users\KOM\Downloads\PO case study 1 - volume factor check NEW.xlsx')
wb = load_workbook(r'C:\Users\KOM\Downloads\PO case study 1 - volume factor check NEW.xlsx') # Work Book
ws1 = wb.get_sheet_by_name("DPPIV & SGLT2") # Work Sheet
pack_size = ws1['F'] # Column F
quantity = ws1['H'] # Column H
conversion = ws1['K'] # Column K
column_list_1 = [pack_size[x].value for x in range(len(pack_size))]
column_list_2 = [quantity[x].value for x in range(len(quantity))]
column_list_3 = [conversion[x].value for x in range(len(conversion))]
for (x, y, z) in zip(column_list_1[7:3030], column_list_2[7:3030], column_list_3[7:3030]):
NumPy implements well optimized broadcasting operations, so that's what I would use.
import numpy as np
...
column_list_1 = np.array(x.value for x in pack_size)
column_list_2 = np.array(x.value for x in quantity)
column_list_3 = np.array(x.value for x in conversion)
result = column_list_1[7:3030] + column_list_2[7:3030] / column_list_3[7:3030]
I also took the liberty to make your comprehensions more Pythonic by iterating directly over the elements. You rarely actually need to use list indices in Python.
You can use the x,y,z values you loop through and just append the answer to a new list:
answer = []
for (x, y, z) in zip(column_list_1[7:3030], column_list_2[7:3030], column_list_3[7:3030]):
answer.append(x + y / z)

Fastest way to extract and increase latest number from end of string

I have a list of strings that have numbers as suffixes. I'm trying to extract the highest number so I can increase it by 1. Here's what I came up with but I'm wondering if there's a faster way to do this:
data = ["object_1", "object_2", "object_3", "object_blah", "object_123asdfd"]
numbers = [int(obj.split("_")[-1]) for obj in data if obj.split("_")[-1].isdigit()] or [0]
print sorted(numbers)[-1] + 1 # Output is 4
A few conditions:
It's very possible that the suffix is not a number at all, and should be skipped.
If no input is valid, then the output should be 1 (this is why I have or [0])
No Python 3 solutions, only 2.7.
Maybe some regex magic would be faster to find the highest number to increment on? I don't like the fact that I have to split twice.
Edit
I did some benchmarks on the current answers using 100 iterations on data that has 10000 items:
Alex Noname's method: 1.65s
Sushanth's method: 1.95s
Balaji Ambresh method: 2.12s
My original method: 2.16s
I've accepted an answer for now, but feel free to contribute.
Using a heapq.nlargest is a pretty efficient way. Maybe someone will compare with other methods.
import heapq
a = heapq.nlargest(1, map(int, filter(lambda b: b.isdigit(), (c.split('_')[-1] for c in data))))[0]
Comparing with the original method (Python 3.8)
import heapq
import random
from time import time
data = []
for i in range(0, 1000000):
data.append(f'object_{random.randrange(10000000)}')
begin = time()
a = heapq.nlargest(1, map(int, filter(lambda b: b.isdigit(), (c.split('_')[-1] for c in data))))[0]
print('nlargest method: ', time() - begin)
print(a)
begin = time()
numbers = [int(obj.split("_")[-1]) for obj in data if obj.split("_")[-1].isdigit()] or [0]
a = sorted(numbers)[-1]
print('original method: ', time() - begin)
print(a)
nlargest method: 0.4306185245513916
9999995
original method: 0.8409149646759033
9999995
try this, using list comprehension to get all digits & max would return the highest value.
max([
int(x.split("_")[-1]) if x.split("_")[-1].isdigit() else 0 for x in data
]) + 1
Try:
import re
res = max([int( (re.findall('_(\d+)$', item) or [0])[0] ) for item in data]) + 1
Value:
4

Change the string output to int to obtain the maximum number?

I am quite new to python so still getting to grips with the language.
I have the following function which takes a string and apply it to an algorithm which tells us if it aligns to models 1, 2, 3, 4, or 5.
Currently this piece of code:
def apply_text(text):
test_str = [text]
test_new = tfidf_m.transform(test_str)
prediction = 0
for m in range(0,5):
percentage = '{P:.1%}'.format(M=cat[m], P=lr_m[m].predict_proba(test_new)[0][1])
print(percentage)
And running the following function: apply_text('Terrible idea.')
Gives the following output:
71.4%
33.1%
2.9%
1.6%
4.9%
With Model 1 = 71.4%, Model 2 = 33.1%, ... Model 5 = 4.9%.
I want to only output the Model number where there is the highest percentage. So in the above example, the answer would be 1 as this has 71.4%.
As the output is a string type I am finding it difficult to find ways of converting this to an int and then comparing each value (probably in a loop of some sort) to obtain the maximum value
I think you want to save the percentages along with the model number, sort it and then return the highest.
This can be done by something like this:
def apply_text(text):
test_str = [text]
test_new = tfidf_m.transform(test_str)
prediction = 0
percentage_list = []
for m in range(0,5):
percentage = '{P:.1}'.format(M=cat[m], P=lr_m[m].predict_proba(test_new)[0][1])
percentage_list.append([m+1, float(percentage)])
percentage_list.sort(reverse=True, key=lambda a: a[1])
return percentage_list[0][0]
Things to note:
Sorting in reverse order as default is ascending. You could skip reversing and access the last element of precentage_list by accessing -1 element
The key function is used as we need to sort using the percentage
Try putting values in a list then you can utilize list methods:
percentage = []
for m in range(0, 5):
percentage.append('{P:.1%}'.format(M=cat[m], P=lr_m[m].predict_proba(test_new)[0][1]))
print(*percentage, sep='\n')
print('Max on model', percentage.index(max(percentage)))
Or using a dictionary:
percentage = {}
for m in range(0, 5):
percentage['Model ' + str(m)] = '{P:.1%}'.format(M=cat[m], P=lr_m[m].predict_proba(test_new)[0][1])
print(*percentage, sep='\n')
print('Max on', max(percentage.keys(), key=(lambda key: percentage[key])))

Group by max or min in a numpy array

I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data by grouping on id and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.
Is there a way I can avoid Python loops and do this in a vectorized manner?
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
In pure Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
A variation:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Based on #Bago's answer:
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
If pandas is installed:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
For example:
data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).
I should point out that the data_id doesn't actually need to be sorted.
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
I think this accomplishes what you're looking for:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.
This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.
The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result

Categories

Resources