Pandas median over grouped by binned data - python

I have a dataframe with users, score, times, where each user's different scores and the number of times they received it are listed:
user1, 1, 4
user1, 7, 2
user2, 3, 1
user2, 10, 2
and so on.
I'd like to calculate for each user the median of the scores.
For that I guess I should create a row-duplicated df, such as -
user1,1
user1,1
user1,1
user1,1
user1,7
user1,7
user2,3
user2,10
user2,10
and then use groupBy and apply to calculate the median somehow?
My questions -
Is this the correct approach? my df is very large so the solution has to be time efficient.
If this is indeed the way to go - can you please advise how? It keeps failing for me whatever I try to do.

I believe you need weighted median. I used function weighted_median from here, you can also try wquantile's weighted.median, but it interpolates in a bit different way so you may achieve nonexpected results):
import numpy as np
import pandas as pd
# from here: https://stackoverflow.com/a/32921444/3025981, CC BY-SA by Afshin # SE
def weighted_median(values, weights):
''' compute the weighted median of values list. The
weighted median is computed as follows:
1- sort both lists (values and weights) based on values.
2- select the 0.5 point from the weights and return the corresponding values as results
e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
sorted values = [0, 1, 3] and corresponding sorted weights = [0.6, 0.1, 0.3] the 0.5 point on
weight corresponds to the first item which is 0. so the weighted median is 0.'''
#convert the weights into probabilities
sum_weights = sum(weights)
weights = np.array([(w*1.0)/sum_weights for w in weights])
#sort values and weights based on values
values = np.array(values)
sorted_indices = np.argsort(values)
values_sorted = values[sorted_indices]
weights_sorted = weights[sorted_indices]
#select the median point
it = np.nditer(weights_sorted, flags=['f_index'])
accumulative_probability = 0
median_index = -1
while not it.finished:
accumulative_probability += it[0]
if accumulative_probability > 0.5:
median_index = it.index
return values_sorted[median_index]
elif accumulative_probability == 0.5:
median_index = it.index
it.iternext()
next_median_index = it.index
return np.mean(values_sorted[[median_index, next_median_index]])
it.iternext()
return values_sorted[median_index]
# end from
def wmed(group):
return weighted_median(group['score'], group['times'])
import pandas as pd
df = pd.DataFrame([
['user1', 1, 4],
['user1', 7, 2],
['user2', 3, 1],
['user2', 10, 2]
], columns = ['user', 'score', 'times'])
groups = df.groupby('user')
groups.apply(wmed)
# user
# user1 1
# user2 10
# dtype: int64

df = pd.DataFrame({'user': ['user1', 'user1', 'user2', 'user2'],
'score': [1, 7, 3, 10],
'times': [4, 2, 1, 2]})
# Create dictionary of empty lists keyed on user.
scores = {user: [] for user in df.user.unique()}
# Expand list of scores for each user using a list comprehension.
_ = [scores[row.user].extend([row.score] * row.times) for row in df.itertuples()]
>>> scores
{'user1': [1, 1, 1, 1, 7, 7], 'user2': [3, 10, 10]}
# Now you can use a dictionary comprehension to calculate the median score of each user.
>>> {user: np.median(scores[user]) for user in scores}
{'user1': 1.0, 'user2': 10.0}

Related

perform numpy mean over matrix using labels as indicators

import numpy as np
arr = np.random.random((5, 3))
labels = [1, 1, 2, 2, 3]
arr
Out[136]:
array([[0.20349907, 0.1330621 , 0.78268978],
[0.71883378, 0.24783927, 0.35576746],
[0.17760916, 0.25003952, 0.29058267],
[0.90379712, 0.78134806, 0.49941208],
[0.08025936, 0.01712403, 0.53479622]])
labels
Out[137]: [1, 1, 2, 2, 3]
assume I have this dataset.
I would like, using the labels as indicators, to perform np.mean over the rows.
(The labels here indicates the class of each row.
labels could also be [0, 1, 1, 0, 4, 1, 4] So have no assumptions over them.)
So the output here will be an average over the:
1st and 2nd row.
3rd and 4th row.
5th row.
in the most efficient way numpy offers. like so:
[np.mean(arr[:2], axis=0),
np.mean(arr[2:4], axis=0),
np.mean(arr[4:], axis=0)]
Out[180]:
[array([0.46116642, 0.19045069, 0.56922862]),
array([0.54070314, 0.51569379, 0.39499737]),
array([0.08025936, 0.01712403, 0.53479622])]
(in real life scenario the matrix dimensions could be (100000, 256))
First we would like to sort our label and matrix:
labels = np.array(labels)
# Getting the indices of a sorted array
sorted_indices = np.argsort(labels)
# Use the indices to sort both labels and matrix
sorted_labels = labels[sorted_indices]
sorted_matrix = matrix[sorted_indices]
Then, we calculate the "steps" or pairs of indices, (from, to) we want to calculate average over, We sum them and divide by their count.
# Here we're getting the amount of rows per label to average (over the sorted_matrix).
# Infact, we're getting the start and end indices per label.
label_indices = np.concatenate(([0], np.where(np.diff(sorted_labels) != 0)[0] + 1, [len(sorted_labels)]))
# using add + reduceat to add all rows with regard to the label indices
group_sums = np.add.reduceat(sorted_matrix, label_indices[:-1], axis=0)
# getting count for each group using the diff in label_indices
group_counts = np.diff(label_indices)
# Calculating the mean
group_means = group_sums / group_counts[:, np.newaxis]
Example:
matrix
Out[265]:
array([[0.69524902, 0.22105336, 0.65631557, 0.54823511, 0.25248685],
[0.61675048, 0.45973729, 0.22410694, 0.71403135, 0.02391662],
[0.02559926, 0.41640708, 0.27931808, 0.29139379, 0.76402121],
[0.27166955, 0.79121862, 0.23512671, 0.32568048, 0.38712154],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.01685432, 0.8395658 , 0.73460083, 0.08056013, 0.02522956],
[0.27274409, 0.64602305, 0.05698037, 0.23214598, 0.75130743],
[0.65069115, 0.32383729, 0.86316629, 0.69659358, 0.26667206],
[0.91971818, 0.02011127, 0.91776206, 0.79474582, 0.39678431],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
labels
Out[266]: array([3, 3, 2, 3, 1, 0, 2, 0, 2, 5])
group_means
Out[267]:
array([[0.33377274, 0.58170155, 0.79888356, 0.38857686, 0.14595081],
[0.94519182, 0.99834516, 0.23381289, 0.40722346, 0.95857389],
[0.40602051, 0.36084713, 0.41802017, 0.43942853, 0.63737099],
[0.52788969, 0.49066976, 0.37184974, 0.52931565, 0.221175 ],
[0.94645805, 0.18057829, 0.23292538, 0.93111373, 0.44815706]])
and the results are suited for: np.unique(sorted_labels)
np.unique(sorted_labels)
Out[271]: array([0, 1, 2, 3, 5])
I did not understand the labels part in your question. but there is a way to calculate the mean of each row in a matrix.
use --> np.mean(arr, axis = 1).
If lables to be used, please go through below mentioned script.
import numpy as np
arr = np.array([[1,2,3],
[4,5,6],
[7,8,9],
[1,2,3],
[4,5,6]])
labels =np.array([0, 1, 1, 0, 4])
#print(arr)
#print('LABEL IS :', labels)
#print('MEAN VALUES ARE : ',np.mean(arr[:2], axis = 1))
id = labels.argsort()
eq_lal = labels[id]
print(eq_lal)
print(arr[eq_lal])
print(np.mean(arr[eq_lal], axis = 1))

How to efficiently split nested list into left and right based on a specific condition for a decision tree function

I am trying to implement a decision tree algorithm in python from scratch which will include 3 parts - splitting the data, calculating entropy / information gain, and training the tree).
Currently, I am having a trouble with splitting the data into X_left, X_right, y_left, y_right based on a specific condition (split attribute is a column and split value is a value to split on). I’ve implemented the code below and it works fine but my actual data is very large, and it takes forever to execute it. I was wondering if there is a way to simplify and make my code more efficient?
Fyi, I know there are multiple packages that I can use to split the data like sklearn, etc but I am trying to do it from scratch first. Appreciate your help in advance!
def parts(X, y, split_attribute, split_val):
X_left = []
X_right = []
y_left = []
y_right = []
count = 0
for x in X:
count += len(x)
attribute_count = count / len(X)
# if split_attribute < len of list, then add to the X_left and X_right, else pass
if split_attribute < attribute_count:
X_left = [x for x in X if x[split_attribute] <= split_val]
X_right = [x for x in X if x[split_attribute] > split_val]
else:
pass
#get indecies of left and right lists after split
left_index = [X.index(item) for item in X_left]
right_index = [X.index(item) for item in X_right]
#get y values based on X_left and X_right indicies
y_left = [y[i] for i in left_index]
y_right = [y[i] for i in right_index]
#############################################
return (X_left, X_right, y_left, y_right)
Inputs:
X = [[3, 10], [1, 22], [2, 28], [5, 32], [4, 32]]
y = [1, 1, 0, 0, 1]
split_attribute = 0
split_val = 1
parts(X, y, split_attribute, split_val)
Output:
([[1, 22]], [[3, 10], [2, 28], [5, 32], [4, 32]], [1], [1, 0, 0, 1])

returning elements in bins as arrays in python

I have x,y,v arrays of data points and I am binning v on x-y plane. I am trying to get the x,y,v values back after binning but I want them as arrays corresponding to each bin. My code can get them individually but that will not work for large data sets with many bins. Maybe I need to use loops of some kind but my understanding of loops is weak. Code:
from scipy import stats
import numpy as np
x=np.array([-10,-2,4,12,3,6,8,14,3])
y=np.array([5,5,-6,8,-20,10,2,2,8])
v=np.array([4,-6,-10,40,22,-14,20,8,-10])
ret = stats.binned_statistic_2d(x,
y,
values,
'count',
bins=2,
expand_binnumbers=True)
print('counts=',ret.statistic)
print('binnumber=', ret.binnumber)
binnumber = ret.binnumber
statistic = ret.statistic
# get the bin numbers according to some condition
idx_bin_x, idx_bin_y = np.where(statistic==statistic[1][1])#[0]
print('idx_binx=',idx_bin_x)
print('idx_bin_y=',idx_bin_y)
# A binnumber of i means the corresponding value is
# between (bin_edges[i-1], bin_edges[i]).
# -> increment the bin indices by one
idx_bin_x += 1
idx_bin_y += 1
print('idx_binx+1=',idx_bin_x)
print('idx_bin_y+1=',idx_bin_y)
# get the boolean mask and apply it
is_event_x = np.in1d(binnumber[0], idx_bin_x)
print('eventx=',is_event_x)
is_event_y = np.in1d(binnumber[1], idx_bin_y)
print('eventy=',is_event_y)
is_event_xy = np.logical_and(is_event_x, is_event_y)
print('event_xy=', is_event_xy)
events_x = x[is_event_xy]
events_y = y[is_event_xy]
event_v=v[is_event_xy]
print('x=', events_x)
print('y=', events_y)
print('v=',event_v)
This outputs x,y,v for the bin with count=5 but I want all 4 bins returning 4 arrays for each x,y,v. eg for bin1: x_bin1=[...], y_bin1=[...], v_bin1=[...] and so on for 4 bins.
Also, feel free to suggest if you think there are easier ways to bin 2d planes (x,y) with values (v) like mine and getting binned values. Thank you!
Using np.array facilitates a compact way to recover the arrays you are after:
from scipy import stats
# coordinates
x = np.array([-10,-2,4,12,3,6,8,14,3])
y = np.array([5,5,-6,8,-20,10,2,2,8])
v = np.array([4,-6,-10,40,22,-14,20,8,-10])
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=2, expand_binnumbers=True)
b = ret.binnumber
for i in [1,2]:
for j in [1,2]:
m = (b[0] == i) & (b[1] == j) # mask
print((list(x[m]),list(y[m]),list(v[m])))
which gives for each of the four bins a tuple of 3 lists corresponding to x, y and v values:
([], [], [])
([-10, -2], [5, 5], [4, -6])
([4, 3], [-6, -20], [-10, 22])
([12, 6, 8, 14, 3], [8, 10, 2, 2, 8], [40, -14, 20, 8, -10])

how to scale down the output of model in the data where we normalized the data based on a custom function

I have a data frame and I normalized the data for training and testing the LSTM model as:
x_normalized = (x_unnormalized-x_min)/(x_max-x_min).
x_min, x_max are the minimum and maximum of each entire rows.
Same as the figure, I choose the last column as the test data.
The model works and etc. However, in this condition, the y_prediction is normalized. I don't know how to see the y_prediction in the real value.
There is any simple solution for that?
Here is the simple code and the normalization:
import pandas as pd
df = pd.DataFrame()
df['x1'] = [ 1, 2,4]
df['x2'] = [ 5, 9, 5]
df['x3'] = [ 3, 21, 10 ]
df['x4'] = [ 8, 32,3 ]
df['x5'] = [ 8, 32,15 ]
df['x6'] = [ 2, 5,15 ]
def norm(df):
MIN = df.min(1)
MAX = df.max(1)
return df.sub(MIN, 0).div(MAX-MIN, 0)
df_normalized = norm(df)
train = df_normalized.iloc[:, 0:5]
test = df_normalized.iloc[:, 5]
Following your formula:
x_unnormalized = x_normalized*(x_max-x_min) + x_min
So you just need to save the values of x_max and x_min of the original dataframe and then have a function executing the formula above

Is there a vectorized way to sample multiples times with np.random.choice() with differents p?

I'm trying to implement a variation ratio, and I need T samples from an array C, but each sample has different weights p_t.
I'm using this:
import numpy as np
from scipy import stats
batch_size = 1
T = 3
C = np.array(['A', 'B', 'C'])
# p_batch_T dimensions: (batch, sample, class)
p_batch_T = np.array([[[0.01, 0.98, 0.01],
[0.3, 0.15, 0.55],
[0.85, 0.1, 0.05]]])
def variation_ratio(C, p_T):
# This function works only with one sample from the batch.
Y_T = np.array([np.random.choice(C, size=1, p=p_t) for p_t in p_T]) # vectorize this
C_mode, frecuency = stats.mode(Y_T)
T = len(Y_T)
return 1.0 - (f/T)
def variation_ratio_batch(C, p_batch_T):
return np.array([variation_ratio(C, p_T) for p_T in p_batch_T]) # and vectorize this
Is there a way to implement these functions with any for?
In stead of sampling with the given distribution p_T, we can sample uniformly between [0,1] and compare that to the cumulative distribution:
Let's start with Y_T, say for p_T = p_batch_T[0]
cum_dist = p_batch_T.cumsum(axis=-1)
idx_T = (np.random.rand(len(C),1) < cum_dist[0]).argmax(-1)
Y_T = C[idx_T[...,None]]
_, f = stats.mode(Y_T) # here axis=0 is default
Now let take that to the variation_ratio_batch:
idx_T = (np.random.rand(len(p_batch_T), len(C),1) < cum_dist).argmax(-1)
Y = C[idx_T[...,None]]
f = stats.mode(Y, axis=1) # notice axis 0 is batch
out = 1 - (f/T)
You could do it this way:
First, create a 2D weights array of shape (T, len(C)) and take the cumulative sum:
n_rows = 5
n_cols = 3
weights = np.random.rand(n_rows, n_cols)
cum_weights = (weights / weights.sum(axis=1, keepdims=True)).cumsum(axis=1)
cum_weights might look like this:
array([[0.09048919, 0.58962127, 1. ],
[0.36333997, 0.58380885, 1. ],
[0.28761923, 0.63413879, 1. ],
[0.39446498, 0.98760834, 1. ],
[0.27862476, 0.79715149, 1. ]])
Next, we can compare cum_weights to the appropriately sized output of np.random.rand. By taking argmin, we find the index in each row where the random number generated is greater than the cumulative weight:
indices = (cum_weights < np.random.rand(n_rows, 1)).argmin(axis=1)
We can then use indices to index an array of values of shape (n_cols,), which is len(C) in your original example.
np.vectorize should work:
from functools import partial
import numpy as np
#partial(np.vectorize, excluded=['rng'], signature='(),(k)->()')
def choice_batched(rng, probs):
return rng.choice(a=probs.shape[-1], p=probs)
then
num_classes = 3
batch_size = 5
alpha = .5 # Dirichlet prior hyperparameter.
rng = np.random.default_rng()
probs = np.random.dirichlet(alpha=np.full(fill_value=alpha, shape=num_classes), size=batch_size)
# Check each row sums to 1.
assert np.allclose(probs.sum(axis=-1), 1)
print(choice_batched(rng, probs))
print(choice_batched(rng, probs))
print(choice_batched(rng, probs))
print(choice_batched(rng, probs))
gives
[2 0 0 0 1]
[1 0 0 0 1]
[2 0 2 0 1]
[1 0 0 0 0]
Here is my implementation of Quang's and gmds' solutions:
def sample(ws, k):
"""Weighted sample k elements along the last axis.
ws -- Tensor of probabilities, shape (*, n)
k -- Number of elements to sample.
Returns tensor of shape (*, k) with values in {0, ..., n-1}.
"""
assert np.allclose(ws.sum(-1), 1)
cs = ws.cumsum(-1)
ps = np.random.random(ws.shape[:-1] + (k,))
return (cs[..., None, :] < ps[..., None]).sum(-1)
Say we have some stuff
>>> stuff = array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
And some weights / sampling probabilities.
>>> ws = array([[0.41296038, 0.36070229, 0.22633733],
[0.37576672, 0.14518771, 0.47904557],
[0.14742326, 0.29182459, 0.56075215]])
And we want to sample 2 elements along each row. Then we do
>>> ids = sample(ws, 2)
[[2, 0],
[1, 2],
[2, 2]]
And we can retrieve the sampled values from stuff using np.take_along_axis:
>>> np.take_along_axis(stuff, ids)
[[2, 0],
[4, 5],
[8, 8]]
The code could be generalized to sampling along an axis other than the last one, but I got confused about broadcasting, so somebody else should have a stab at it!

Categories

Resources