Pandas efficiently add new column true/false if between two other columns - python

Using Pandas, how can I efficiently add a new column that is true/false if the value in one column (x) is between the values in two other columns (low and high)?
The np.select approach from here works perfectly, but I "feel" like there should be a one-liner way to do this.
Using Python 3.7
fid = [0, 1, 2, 3, 4]
x = [0.18, 0.07, 0.11, 0.3, 0.33]
low = [0.1, 0.1, 0.1, 0.1, 0.1]
high = [0.2, 0.2, 0.2, 0.2, 0.2]
test = pd.DataFrame(data=zip(fid, x, low, high), columns=["fid", "x", "low", "high"])
conditions = [(test["x"] >= test["low"]) & (test["x"] <= test["high"])]
labels = ["True"]
test["between"] = np.select(conditions, labels, default="False")
display(test)

Like mentioned by #Brebdan, you can use this builtin:
test["between"] = test["x"].between(test["low"], test["high"])
output:
fid x low high between
0 0 0.18 0.1 0.2 True
1 1 0.07 0.1 0.2 False
2 2 0.11 0.1 0.2 True
3 3 0.30 0.1 0.2 False
4 4 0.33 0.1 0.2 False

Related

How to filter for rows with close values across columns

I have columns of probabilities in a pandas dataframe as an output from multiclass machine learning.
I am looking to filter rows for which the model had very close probabilities between the classes for that row, and ideally only care about similar values that are similar to the highest value in that row, but I'm not sure where to start.
For example my data looks like this:
ID class1 class2 class3 class4 class5
row1 0.97 0.2 0.4 0.3 0.2
row2 0.97 0.96 0.4 0.3 0.2
row3 0.7 0.5 0.3 0.4 0.5
row4 0.97 0.98 0.99 0.3 0.2
row5 0.1 0.2 0.3 0.78 0.8
row6 0.1 0.11 0.3 0.9 0.2
I'd like to filter for rows where at least 2 (or more) probability class columns have a probability that is close to at least one other probability column in that row (e.g., maybe within 0.05). So an example output would filter to:
ID class1 class2 class3 class4 class5
row2 0.97 0.96 0.4 0.3 0.2
row4 0.97 0.98 0.99 0.3 0.2
row5 0.1 0.2 0.3 0.78 0.8
I don't mind if a filter includes row6 as it also meets my <0.05 different main requirement, but ideally because the 0.05 difference isn't with the largest probability I'd prefer to ignore this too.
What can I do to develop a filter like this?
Example data:
Edit: I have increased the size of my example data, as I do not want pairs specifically but any and all rows that in inside their row their column values for 2 or more probabilities have close values
d = {'ID': ['row1', 'row2', 'row3', 'row4', 'row5', 'row6'],
'class1': [0.97, 0.97, 0.7, 0.97, 0.1, 0.1],
'class2': [0.2, 0.96, 0.5, 0.98, 0.2, 0.11],
'class3': [0.4, 0.4, 0.3, 0.2, 0.3, 0.3],
'class4': [0.3, 0.3, 0.4, 0.3, 0.78, 0.9],
'class5': [0.2, 0.2, 0.5, 0.2, 0.8, 0.2]}
df = pd.DataFrame(data=d)
Here is an example using numpy and itertools.combinations to get the pairs of similar rows with at least N matches with 0.05:
from itertools import combinations
import numpy as np
df2 = df.set_index('ID')
N = 2
out = [(a, b) for a,b in combinations(df2.index, r=2)
if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N]
Output:
[('row1', 'row2'), ('row1', 'row4'), ('row2', 'row4')]
follow-up
My real data is 10,000 rows and I want to filter out all rows that
have more than one column of probabilities that are close to each
other. Is there a way to do this without specifying pairs
from itertools import combinations
N = 2
df2 = df.set_index('ID')
keep = set()
seen = set()
for a,b in combinations(df2.index, r=2):
if {a,b}.issubset(seen):
continue
if np.isclose(df2.loc[a], df2.loc[b], atol=0.05).sum()>=N:
keep.update({a, b})
seen.update({a, b})
print(keep)
# {'row1', 'row2', 'row4'}
You can do that with:
Transpose the dataframe to get each sample as column and classes probabilities as rows.
We only need to check the minimal requirement which is if the difference between the 2 largest values is less than or equal 0.05.
df = pd.DataFrame(data=d).set_index("ID").T
result = [col for col in df.columns if np.isclose(*df[col].nlargest(2), atol=0.05)]
Output:
['row2', 'row4', 'row5']'
Dataframe after the transpose:
ID row1 row2 row3 row4 row5 row6
class1 0.97 0.97 0.7 0.97 0.10 0.10
class2 0.20 0.96 0.5 0.98 0.20 0.11
class3 0.40 0.40 0.3 0.20 0.30 0.30
class4 0.30 0.30 0.4 0.30 0.75 0.90
class5 0.20 0.20 0.5 0.20 0.80 0.20

program to auto round off and check whether two numbers are equal in dataframe [duplicate]

I have the following dataframe:
actual_credit min_required_credit
0 0.3 0.4
1 0.5 0.2
2 0.4 0.4
3 0.2 0.3
I need to add a column indicating where actual_credit >= min_required_credit. The result would be:
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.1 0.3 False
I am doing the following:
df['result'] = abs(df['actual_credit']) >= abs(df['min_required_credit'])
However the 3rd row (0.4 and 0.4) is constantly resulting in False. After researching this issue at various places including: What is the best way to compare floats for almost-equality in Python? I still can't get this to work. Whenever the two columns have an identical value, the result is False which is not correct.
I am using python 3.3
Due to imprecise float comparison you can or your comparison with np.isclose, isclose takes a relative and absolute tolerance param so the following should work:
df['result'] = df['actual_credit'].ge(df['min_required_credit']) | np.isclose(df['actual_credit'], df['min_required_credit'])
#EdChum's answer works great, but using the pandas.DataFrame.round function is another clean option that works well without the use of numpy.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df['result'] = df['actual_credit'].round(1) >= df['min_required_credit'].round(1)
print(df)
actual_credit min_required_credit result
0 0.3 0.400 False
1 0.5 0.200 True
2 0.4 0.401 True
3 0.2 0.300 False
You might consider using round() to more permanently edit your dataframe, depending if you desire that precision or not. In this example, it seems like the OP suggests this is probably just noise and is just causing confusion.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df = df.round(1)
df['result'] = df['actual_credit'] >= df['min_required_credit']
print(df)
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.2 0.3 False
In general numpy Comparison functions work well with pd.Series and allow for element-wise comparisons:
isclose, allclose, greater, greater_equal, less, less_equal etc.
In your case greater_equal would do:
df['result'] = np.greater_equal(df['actual_credit'], df['min_required_credit'])
or alternatively, as proposed, using pandas.ge(alternatively le, gt etc.):
df['result'] = df['actual_credit'].ge(df['min_required_credit'])
The risk with oring with ge (as mentioned above) is that e.g. comparing 3.999999999999 and 4.0 might return True which might not necessarily be what you want.
Use pandas.DataFrame.abs() instead of the built-in abs():
df['result'] = df['actual_credit'].abs() >= df['min_required_credit'].abs()

How to efficiently clip subsets of pandas series/data-frames to range

I need to do a large amount of data-frame slices and to update the value of a column in the slice to the minimum between existing value and a constant.
My current code looks like this
for indices value in list_of_slices:
df.loc[indices,'SCORE'] = df.loc[indices,'SCORE'].clip(upper=value)
This is quite efficient and much faster than the apply method I used in the beginning, however still somewhat too slow for a large list.
I expected to be able to write
df.loc[indices,'SCORE'].clip(upper=value, inplace=True)
to save on slicing twice, but that doesn't work.
Also saving the slice to a tmp variable seems to create a copy, thus not changing the original df.
Is there a better way to do this loop and/or set the value without slicing the data-frame twice?
If you could generate a dictionary where (key, value) pairs will be the index to clip with a given values. For example, considering the following dataframe
import pandas as pd
import numpy as np
d = {
'categorical_identifier': [1, 2, 3, 1, 2, 3, 1, 2, 3],
'SCORE': [0.02, 0.04, 0.67, 0.01, 0.45, 0.89, 0.39, 0.25, 0.47]
}
df = pd.DataFrame(d)
df
>>>
categorical_identifier SCORE
0 1 0.02
1 2 0.04
2 3 0.67
3 1 0.01
4 2 0.45
5 3 0.89
6 1 0.39
7 2 0.25
8 3 0.47
if I generate a dictionary mapping by index which value to clip to as the following
indices_max_values = {
0: 0.10,
1: 0.3,
2: 0.9,
3: 0.10,
4: 0.3,
5: 0.9,
6: 0.10,
7: 0.3,
8: 0.9,
}
Notice that if you have a set of slices you can generate this dictionary by filtering the True values of each condition.
from collections import ChainMap
list_of_slice = [
df.categorical_identifier == 1,
df.categorical_identifier == 2,
df.categorical_identifier == 3
]
dict_of_slice = [{k:v for k, v in dict(s).items() if v} for s in list_of_slice]
dict_of_slice = dict(ChainMap(*dict_of_slice))
dict_of_slice
>>>
{2: True,
5: True,
8: True,
1: True,
4: True,
7: True,
0: True,
3: True,
6: True}
just replace v with the value you want to clip to when creating dict_of_slice.
Then you can apply np.clip() to each element, identifying the value to clip by the value of the index.
df.reset_index(inplace=True)
df.rename(columns={'index':'Index'}, inplace=True)
existing_value = 0
df[['Index', 'SCORE']].transform(
lambda x: np.clip(x, a_min=existing_value, a_max=indices_max_values[x.Index]),
axis=1
)
>>>
Index SCORE
0 0.0 0.02
1 0.3 0.04
2 0.9 0.67
3 0.1 0.01
4 0.3 0.30
5 0.9 0.89
6 0.1 0.10
7 0.3 0.25
8 0.9 0.47

How to update a matrix of probabilities

I am trying to find/figure out a function that can update probabilities.
Suppose there are three players and each of them get a fruit out of a basket: ["apple", "orange", "banana"]
I store the probabilities of each player having each fruit in a matrix (like this table):
apple
orange
banana
Player 1
0.3333
0.3333
0.3333
Player 2
0.3333
0.3333
0.3333
Player 3
0.3333
0.3333
0.3333
The table can be interpreted as the belief of someone (S) who doesn't know who has what. Each row and column sums to 1.0 because each player has one of the fruits and each fruit is at one of the players.
I want to update these probabilities based on some knowledge that S gains. Example information:
Player 1 did X. We know that Player 1 does X with 80% probability if he has an apple. With 50% if he has an orange. With 10% if he has a banana.
This can be written more concisely as [0.8, 0.5, 0.1] and let us call it reach_probability.
A fairly easy to comprehend example is:
probabilities = [
[0.5, 0.5, 0.0],
[0.0, 0.5, 0.5],
[0.5, 0.0, 0.5],
]
# Player 1's
reach_probability = [1.0, 0.0, 1.0]
new_probabilities = [
[1.0, 0.0, 0.0],
[0.0, 1.0, 0.0],
[0.0, 0.0, 1.0],
]
The above example can be fairly easily thought through.
another example:
probabilities = [
[0.25, 0.25, 0.50],
[0.25, 0.50, 0.25],
[0.50, 0.25, 0.25],
]
# Player 1's
reach_probability = [1.0, 0.5, 0.5]
new_probabilities = [
[0.4, 0.2, 0.4],
[0.2, 0.5, 0.3],
[0.4, 0.3, 0.3],
]
In my use case using a simulation is not an option. My probabilities matrix is big. Not sure if the only way to calculate this is using an iterative algorithm or if there is a better way.
I looked at bayesian stuff and not sure how to apply it in this case. Updating it row by row then spreading out the difference proportionally to the previous probabilities seems promising but I haven't managed to make it work correctly. Maybe it isn't even possible like that.
Initial condition: p(apple) = p(orange) = p(banana) = 1/3.
Player 1 did X. We know that Player 1 does X with 80% probability if he has an apple. With 50% if he has an orange. With 10% if he has a banana.
p(X | apple) = 0.8
p(x | orange) = 0.5
p(x | banana) = 0.1
Since apple, orange, and banana are all equally likely at 1/3, we have p(x) = 1/3 * 1.4) ~ 0.466666666.
Recall Bayes theorem: p(a | b) = p(b|a) * p(a) / p(b)
So p(apple | x) = p(x | apple) * p(apple) / p(x) = 0.8 * (1/3) / 0.46666666 ~ 57.14%
similarly p(orange | x) = 0.5 * (1/3) / 0.46666666 ~ 35.71%
and p(banana | x) = 0.1 * (1/3) / 0.46666666 ~ 7.14%
Taking your example:
probabilities = [
[0.25, 0.25, 0.50],
[0.25, 0.50, 0.25],
[0.50, 0.25, 0.25],
]
# Player 1's
reach_probability = [1.0, 0.5, 0.5]
new_probabilities = [
[0.4, 0.2, 0.4],
[0.2, 0.5, 0.3],
[0.4, 0.3, 0.3],
]
p(x) = 0.25 * 1.0 + 0.25 * 0.5 + 0.5 * 0.5 = 0.625
p(a|x) = p(x|a) * p(a) / p(x) = 1.0 * 0.25 / 0.625 = 0.4
p(b|x) = p(x|b) * p(b) / p(x) = 0.5 * 0.25 / 0.625 = 0.2
p(c|x) = p(x|c) * p(c) / p(x) = 0.5 * 0.50 / 0.625 = 0.4
As desired. The other entries of each column can just be scaled to get a column sum of 1.0.
E.g. in column 1 we multiple the other entries by (1-0.4)/(1-0.25). This takes 0.25 -> 0.2 and 0.50 -> 0.40. Similarly for the other columns.
new_probabilities = [
[0.4, 0.200, 0.4],
[0.2, 0.533, 0.3],
[0.4, 0.266, 0.3],
]
If then player 2 does y with the same conditional probabilities we get:
p(y) = 0.2 * 1.0 + 0.533 * 0.5 + 0.3 * 0.5 = 0.6165
p(a|y) = p(y|a) * p(a) / p(y) = 1.0 * 0.2 / 0.6165 = 0.3244
p(b|y) = p(y|b) * p(b) / p(y) = 0.5 * 0.533 / 0.6165 = 0.4323
p(c|y) = p(y|c) * p(c) / p(y) = 0.5 * 0.266 / 0.6165 = 0.2157
Check this document:
Endgame Solving in Large Imperfect-Information Games∗
(S. Ganzfried, T. Sandholm, in International Conference on Autonomous Agents and MultiAgent Systems (AAMAS) (2015), pp. 37–45.)
Here is how I would approach this - have not worked through whether this has problems too but it seems alright in your examples.
Assume each update is of the form "X,Y has probability p'" Mark element X,Y dirty with delta p - p', where p was the old probability. Now, redistribute the delta proportionally to all unmarked elements in the row, then the column, marking each dirty with its own delta, and marking the first clean. Continue until no dirty entry remains.
0.5 0.5 0.0
0.0 0.5 0.5
0.5 0.0 0.5
Belief: 2,1 has probability zero.
0.5 0.0* 0.0 update 2,1 and mark dirty
0.0 0.5 0.5 delta is 0.5
0.5 0.0 0.5
1.0* 0.0' 0.0 distribute 0.5 to row & col
0.0 1.0* 0.5 update as dirty, both deltas -0.5
0.5 0.0 0.5
1.0' 0.0' 0.0 distribute -0.5 to rows & cols
0.0 1.0' 0.0* update as dirty, both deltas 0.5
0.0* 0.0 0.5
1.0' 0.0' 0.0 distribute 0.5 to row & col
0.0 1.0' 0.0' update as dirty, delta is -0.5
0.0' 0.0 1.0*
1.0' 0.0' 0.0 distribute on row/col
0.0 1.0' 0.0' no new dirty elements, complete
0.0' 0.0 1.0'
In your first example:
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
Belief: 3,1 has probability 0
1/3 1/3 0* update 3,1 to zero, mark dirty
1/3 1/3 1/3 delta is 1/3
1/3 1/3 1/3
1/2* 1/2* 0' distribute 1/3 proportionally across row then col
1/3 1/3 1/2* delta is -1/6
1/3 1/3 1/2*
1/2' 1/2' 0' distribute -1/6 proportionally across row then col
1/4* 1/4* 1/2' delta is 1/12
1/4* 1/4* 1/2'
1/2' 1/2' 0' distribute prportionally to unmarked entries
1/4' 1/4' 1/2' no new dirty entries, terminate
1/4' 1/4' 1/2'
You can mark entries dirty by inserting them with associated deltas into a queue and a hashset. Entries in both the queue and hash set are dirty. Entries in the hashset only are clean. Process the queue until you run out of entries.
I do not show an example where distribution is uneven, but the key is to distribute proportionally. Entries with 0 can never become non-zero except by a new belief.
Unfortunately there’s no known nice solution.
The way that I would apply Bayesian reasoning is to store a likelihood
matrix instead of a probability matrix. (Actually I’d store
log-likelihoods to prevent underflow, but that’s an implementation
detail.) We can start with the matrix
Apple
Orange
Banana
1
1
1
1
2
1
1
1
3
1
1
1
representing no knowledge. You could use the all-1/3 matrix instead, but
I’ve used 1 to emphasize that normalization is not required. To apply an
update like Player 1 doing X with conditional probabilities [0.8, 0.5,
0.1], we just multiply the row element-wise:
Apple
Orange
Banana
1
0.8
0.5
0.1
2
1
1
1
3
1
1
1
If Player 1 does Y independently with the same conditional
probabilities, then we get
Apple
Orange
Banana
1
0.64
0.25
0.01
2
1
1
1
3
1
1
1
Now, the rub is that these likelihoods don’t have a nice relationship to
probabilities of specific outcomes. All we know is that the probability
of a specific matching is proportional to the product of its matrix
entries. As a simple example, with a matrix like
Apple
Orange
Banana
1
1
0
0
2
0
1
0
3
0
1
1
the entry for Player 3 having Orange is 1, yet this assignment has
probability 0 because both possibilities for completing the matching
have probability 0.
What we need is the
permanent,
which sums the likelihood of every matching, and the minor for each
matrix entry, which sums the likelihood of every matching that makes the
corresponding assignment. Unfortunately we don’t know a good exact
algorithm for computing the permanent, and experts are skeptical that
one exists (the problem is NP-hard, and actually #P-complete). The
known approximation employs sampling via Markov chains.

Efficiently combine min/max on different columns of a pandas dataframe

I have a pandas dataframe that contains the results of computation and need to:
take the maximum value of a column and for that value find the maximum value of another column
take the minimum value of a column and for that value find the maximum value of another column
Is there a more efficient way to do it?
Setup
metrictuple = namedtuple('metrics', 'prob m1 m2')
l1 =[metrictuple(0.1, 0.4, 0.04),metrictuple(0.2, 0.4, 0.04),metrictuple(0.4, 0.4, 0.1),metrictuple(0.7, 0.2, 0.3),metrictuple(1.0, 0.1, 0.5)]
df = pd.DataFrame(l1)
# df
# prob m1 m2
#0 0.1 0.4 0.04
#1 0.2 0.4 0.04
#2 0.4 0.4 0.10
#3 0.7 0.2 0.30
#4 1.0 0.1 0.50
tmp = df.loc[(df.m1.max() == df.m1), ['prob','m1']]
res1 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.4, 0.4)
tmp = df.loc[(df.m2.min() == df.m2), ['prob','m2']]
res2 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.2, 0.04)
Pandas isn't ideal for numerical computations. This is because there is a significant overhead in slicing and selecting data, in this example df.loc.
The good news is that pandas interacts well with numpy, so you can easily drop down to the underlying numpy arrays.
Below I've defined some helper functions which makes the code more readable. Note that numpy slicing is performed via row and column numbers starting from 0.
arr = df.values
def arr_max(x, col):
return x[x[:,col]==x[:,col].max()]
def arr_min(x, col):
return x[x[:,col]==x[:,col].min()]
res1 = arr_max(arr_max(arr, 1), 0)[:,:2] # array([[ 0.4, 0.4]])
res2 = arr_max(arr_min(arr, 2), 0)[:,[0,2]] # array([[ 0.2 , 0.04]])

Categories

Resources