Is there a concise way to produce this dataframe mask?

Is there a concise way to produce this dataframe mask? - python

I have a pandas DataFrame with a model score and order_amount_bucket. There are 8 bins in the order amount bucket and I have a different threshold for each bin. I want to filter the frame and produce a boolean mask showing which rows pass.
I can do this by exhaustively listing the conditions but I feel like there must be a more pythonic way to do this.
A small example of how I have made this work so far (with only 3 bins for simplicity).
import pandas as pd
sc = 'score'
amt = 'order_amount_bucket'
example_data = {sc:[0.5, 0.8, 0.99, 0.95, 0.8,0.8],
amt: [1, 2, 2, 2, 3, 1]}
thresholds = [0.7, 0.8, 0.9]
df = pd.DataFrame(example_data)
# the exhaustive method to create the pass mask
# is there a better way to do this part?
pass_mask = (((df[amt]==1) & (df[sc]<thresholds[0]))
|((df[amt]==2) & (df[sc]<thresholds[1]))
|((df[amt]==3) & (df[sc]<thresholds[2]))
)
pass_mask.values
>> array([ True, False, False, False, True, False])

You could covert thresholds to a dict and use Series.map:
d = dict(enumerate(thresholds, 1))
# d: {1: 0.7, 2: 0.8, 3: 0.9}
pass_mark = df['order_amount_bucket'].map(d) > df['score']
[out]
print(pass_mark.values)
array([ True, False, False, False, True, False])

Related

Pandas efficient uniform crossover

I am looking for efficient way of implementing uniform crossover in numpy pandas.
Every solution consists of numpy array and a number:
population = pd.DataFrame({
"mask": [get_random_genotype() for _ in range(pop_size)]
"X": [np.random.random() for _ in range(pop_size)]})
I would like to do parallel uniform crossover of chosen sub-population, eg.:
pairs = np.array([[0, 2],[1, 3]])
for male, female in pairs:
mask = random_mask() #[True, False, False, True]
new_male.mask= where(mask, male, female)
new_female.mask = where(mask, female, male)
but in compeletly parallel manner. I have already tried:
temp: pd.DataFrame = population.copy()
draw: np.ndarray = np.random.choice(
a = [True, False],
size = np.stack(temp["mask"][pairs[X, 0]]).shape,
)
population.loc[pairs[X, 0], "mask"] = pd.Series(np.where(draw, np.stack(temp["mask"][pairs[X, 0]]), np.stack(temp["mask"][pairs[X, 1]])).tolist())
population.loc[pairs[X, 1], "mask"] = pd.Series(np.where(draw, np.stack(temp["mask"][pairs[X, 1]]), np.stack(temp["mask"][pairs[X, 0]])).tolist())
but it didn't work, apparently some of my masks became Nan's I have no idea whether I should solve it this way. I think solution that works in the same way on float/int instead of array would be sufficient as well:
X = pd.DataFrame({"x":[x//2 for x in range(10)]})
mask = [True, False, False, True]
X.loc[[1,2,3,4], "x"] = pd.Series(np.where(mask,X.loc[[1,2,3,4], "x"], X.loc[[5,6,7,8], "x"]).tolist(), dtype = int)
Nan's are still appearing.

Using numpy where functions

I am trying to understand the behavior of the following piece of code:
import numpy as np
theta = np.arange(0,1.1,0.1)
prior_theta = 0.7
prior_prob = np.where(theta == prior_theta)
print(prior_prob)
However if I explicitly give the datatype the where function works as per expectation
import numpy as np
theta = np.arange(0,1.1,0.1,dtype = np.float32)
prior_theta = 0.7
prior_prob = np.where(theta == prior_theta)
print(prior_prob)
This seems like a data type comparison. Any idea on this will be very helpful.

This is just how floating point numbers work. You can't rely on exact comparisons. The number 0.7 cannot be represented in binary -- it is an infinitely repeating fraction. arange has to compute 0.1+0.1+0.1+0.1 etc,, and the round-off errors accumulate. The 7th value is not exactly the same as the literal value 0.7. The rounding is different for float32s, so you happened to get lucky.
You need to get in the habit of using "close enough" comparisons, like where(np.abs(theta-prior_theta) < 0.0001).

np.isclose (and np.allclose) is useful when making floats tests.
In [240]: theta = np.arange(0,1.1,0.1)
In [241]: theta
Out[241]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [242]: theta == 0.7
Out[242]:
array([False, False, False, False, False, False, False, False, False,
False, False])
np.arange warns us about using float increments - read the warnings section.
In [243]: theta.tolist()
Out[243]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [244]: np.isclose(theta, 0.7)
Out[244]:
array([False, False, False, False, False, False, False, True, False,
False, False])
In [245]: np.nonzero(np.isclose(theta, 0.7))
Out[245]: (array([7]),)
arange suggests using np.linspace, but that's more to address the end point issue, which you've already handled with 1.1 value. The 0.7 value is still the same.

Numpy: find indicies conditioned on values in two different arrays (coming from R)

I have a volume represented by a 3D ndarray, X, with values between, say, 0 and 255, and I have another 3D ndarray, Y, that is an arbitrary mask of the first array, with values of either 0 or 1.
I want to find the indicies of a random sample of 50 voxels that is both greater than zero in X, the 'image', and equal to 1 in Y, the 'mask'.
My experience is with R, where the following would work:
idx <- sample(which(X>0 & Y==1), 50)
Maybe the advantage in R is that I can index 3D arrays linearly, because just using a single index in numpy gives me a 2D matrix, for example.
I guess it probably involves numpy.random.choice, but it doesn't seem like I can use that conditionally, let alone conditioned on two different arrays. Is there another approach I should be using instead?

Here's one way -
N = 50 # number of samples needed (50 for your actual case)
# Get mask based on conditionals
mask = (X>0) & (Y==1)
# Get corresponding linear indices (easier to random sample in next step)
idx = np.flatnonzero(mask)
# Get random sample
rand_idx = np.random.choice(idx, N)
# Format into three columnar output (each col for each dim/axis)
out = np.c_[np.unravel_index(rand_idx, X.shape)]
If you need random sample without replacement, use np.random.choice() with optional arg replace=False.
Sample run -
In [34]: np.random.seed(0)
...: X = np.random.randint(0,4,(2,3,4))
...: Y = np.random.randint(0,2,(2,3,4))
In [35]: N = 5 # number of samples needed (50 for your actual case)
...: mask = (X>0) & (Y==1)
...: idx = np.flatnonzero(mask)
...: rand_idx = np.random.choice(idx, N)
...: out = np.c_[np.unravel_index(rand_idx, X.shape)]
In [37]: mask
Out[37]:
array([[[False, True, True, False],
[ True, False, True, False],
[ True, False, True, True]],
[[False, True, True, False],
[False, False, False, True],
[ True, True, True, True]]], dtype=bool)
In [38]: out
Out[38]:
array([[1, 0, 1],
[0, 0, 1],
[0, 0, 2],
[1, 1, 3],
[1, 1, 3]])
Correlate the output out against the places of True values in mask for a quick verification.
If you don't want to flatten for getting the linear indices and directly get the indices per dim/axis, we can do it like so -
i0,i1,i2 = np.where(mask)
rand_idx = np.random.choice(len(i0), N)
out = np.c_[i0,i1,i2][rand_idx]
For performance, index first and then concatenate with np.c_ at the last step -
out = np.c_[i0[rand_idx], i1[rand_idx], i2[rand_idx]]

Pandas median over grouped by binned data

I have a dataframe with users, score, times, where each user's different scores and the number of times they received it are listed:
user1, 1, 4
user1, 7, 2
user2, 3, 1
user2, 10, 2
and so on.
I'd like to calculate for each user the median of the scores.
For that I guess I should create a row-duplicated df, such as -
user1,1
user1,1
user1,1
user1,1
user1,7
user1,7
user2,3
user2,10
user2,10
and then use groupBy and apply to calculate the median somehow?
My questions -
Is this the correct approach? my df is very large so the solution has to be time efficient.
If this is indeed the way to go - can you please advise how? It keeps failing for me whatever I try to do.

I believe you need weighted median. I used function weighted_median from here, you can also try wquantile's weighted.median, but it interpolates in a bit different way so you may achieve nonexpected results):
import numpy as np
import pandas as pd
# from here: https://stackoverflow.com/a/32921444/3025981, CC BY-SA by Afshin # SE
def weighted_median(values, weights):
''' compute the weighted median of values list. The
weighted median is computed as follows:
1- sort both lists (values and weights) based on values.
2- select the 0.5 point from the weights and return the corresponding values as results
e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
sorted values = [0, 1, 3] and corresponding sorted weights = [0.6, 0.1, 0.3] the 0.5 point on
weight corresponds to the first item which is 0. so the weighted median is 0.'''
#convert the weights into probabilities
sum_weights = sum(weights)
weights = np.array([(w*1.0)/sum_weights for w in weights])
#sort values and weights based on values
values = np.array(values)
sorted_indices = np.argsort(values)
values_sorted = values[sorted_indices]
weights_sorted = weights[sorted_indices]
#select the median point
it = np.nditer(weights_sorted, flags=['f_index'])
accumulative_probability = 0
median_index = -1
while not it.finished:
accumulative_probability += it[0]
if accumulative_probability > 0.5:
median_index = it.index
return values_sorted[median_index]
elif accumulative_probability == 0.5:
median_index = it.index
it.iternext()
next_median_index = it.index
return np.mean(values_sorted[[median_index, next_median_index]])
it.iternext()
return values_sorted[median_index]
# end from
def wmed(group):
return weighted_median(group['score'], group['times'])
import pandas as pd
df = pd.DataFrame([
['user1', 1, 4],
['user1', 7, 2],
['user2', 3, 1],
['user2', 10, 2]
], columns = ['user', 'score', 'times'])
groups = df.groupby('user')
groups.apply(wmed)
# user
# user1 1
# user2 10
# dtype: int64

df = pd.DataFrame({'user': ['user1', 'user1', 'user2', 'user2'],
'score': [1, 7, 3, 10],
'times': [4, 2, 1, 2]})
# Create dictionary of empty lists keyed on user.
scores = {user: [] for user in df.user.unique()}
# Expand list of scores for each user using a list comprehension.
_ = [scores[row.user].extend([row.score] * row.times) for row in df.itertuples()]
>>> scores
{'user1': [1, 1, 1, 1, 7, 7], 'user2': [3, 10, 10]}
# Now you can use a dictionary comprehension to calculate the median score of each user.
>>> {user: np.median(scores[user]) for user in scores}
{'user1': 1.0, 'user2': 10.0}

Numpy inverse mask

I want to inverse the true/false value in my numpy masked array.
So in the example below i don't want to mask out the second value in the data array, I want to mask out the first and third value.
Below is just an example. My masked array is created by a longer process than runs before. So I can not change the mask array itself. Is there another way to inverse the values?
import numpy
data = numpy.array([[ 1, 2, 5 ]])
mask = numpy.array([[0,1,0]])
numpy.ma.masked_array(data, mask)

import numpy
data = numpy.array([[ 1, 2, 5 ]])
mask = numpy.array([[0,1,0]])
numpy.ma.masked_array(data, ~mask) #note this probably wont work right for non-boolean (T/F) values
#or
numpy.ma.masked_array(data, numpy.logical_not(mask))
for example
>>> a = numpy.array([False,True,False])
>>> ~a
array([ True, False, True], dtype=bool)
>>> numpy.logical_not(a)
array([ True, False, True], dtype=bool)
>>> a = numpy.array([0,1,0])
>>> ~a
array([-1, -2, -1])
>>> numpy.logical_not(a)
array([ True, False, True], dtype=bool)

Latest Python version also support '~' character as 'logical_not'. For Example
import numpy
data = numpy.array([[ 1, 2, 5 ]])
mask = numpy.array([[False,True,False]])
result = data[~mask]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a concise way to produce this dataframe mask? - python

You could covert thresholds to a dict and use Series.map: d = dict(enumerate(thresholds, 1)) # d: {1: 0.7, 2: 0.8, 3: 0.9} pass_mark = df['order_amount_bucket'].map(d) > df['score'] [out] print(pass_mark.values) array([ True, False, False, False, True, False])

Related

Pandas efficient uniform crossover

Using numpy where functions

Numpy: find indicies conditioned on values in two different arrays (coming from R)

Pandas median over grouped by binned data

Numpy inverse mask

Categories

Resources