Efficient chaining of boolean indexers in pandas DataFrames

Efficient chaining of boolean indexers in pandas DataFrames - python

I am trying to very efficiently chain a variable amount of boolean pandas Series, to be used as a filter on a DataFrame through boolean indexing.
Normally when dealing with multiple boolean conditions, one chains them like this
condition_1 = (df.A > some_value)
condition_2 = (df.B <= other_value)
condition_3 = (df.C == another_value)
full_indexer = condition_1 & condition_2 & condition_3
but this becomes a problem with a variable amount of conditions.
bool_indexers = [
condition_1,
condition_2,
...,
condition_N,
]
I have tried out some possible solutions, but I am convinced it can be done more efficiently.
Option 1
Loop over the indexers and apply consecutively.
full_indexer = bool_indexers[0]
for indexer in bool_indexers[1:]:
full_indexer &= indexer
Option 2
Put into a DataFrame and calculate the row product.
full_indexer = pd.DataFrame(bool_indexers).product(axis=0)
Option 3
Use numpy.product (like in this answer) and create a new Series out of the result.
full_indexer = pd.Series(np.prod(np.vstack(bool_indexers), axis=0))
All three solutions are somewhat inefficient because they rely on looping or force you to create a new object (which can be slow if repeated many times).
Can it be done more efficiently or is this it?

Use np.logical_and:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 1, 2], 'B': [0, 1, 2], 'C': [0, 1, 2]})
m1 = df.A > 0
m2 = df.B <= 1
m3 = df.C == 1
m = np.logical_and.reduce([m1, m2, m3])
# OR m = np.all([m1, m2, m3], axis=0)
out = df[np.logical_and.reduce([m1, m2, m3])]
Output:
>>> pd.concat([m1, m2, m3], axis=1)
A B C
0 False True False
1 True True True
2 True False False
>>> m
array([False, True, False])
>>> out
A B C
1 1 1 1

Related

How to resolve Pandas performance warning "highly fragmented" after using many custom np.where statements?

I have a project where I am converting code from SQL to Pandas. I have 80 custom elements in my dataset / dataframe - each requires custom logic. In SQL, I use multiple case statements within a single Select like this:
Select x, y, z,
(case when statement1 then 0
when statement2 then 0
else 1 end) as custom_element1,
next case statement...as custom_element2,
next case statement...as custom_element3,
etc...
Now in Pandas, I am hoping for some advice on the most efficient way to accomplish the same goal. To make it easier to reproduce, here is an example that does the same thing that I want to do. I need to create 80 custom output variables. In this example, I am just adding one custom element at a time using different np.where statements.
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0]},
index=['falcon', 'dog', 'spider', 'fish'])
df['custom1'] = np.where(df['num_legs'].values > 2, 1, 0)
df['custom2'] = np.where(df['num_wings'] == df['num_legs'], 1, 0)
df['custom3'] = np.where((df['num_wings'].values == 0) | (df['num_legs'].values == 0), 1, 0)
I can get the output from consecutive np.where statements to match my output from original SQL exactly, so no problems there.
BUT I saw this warning:
DataFrame is highly fragmented...poor performance...Consider using pd.concat
instead...or use copy().
So my question is, for my example, how do I improve performance? How would I use pd.concat here? What is a better way to structure the code than what I am showing above? I have tried searching for an answer in this forum but did not find anything. I appreciate your time in responding.

So, np.where is totally unecessary here. For example, you can just use:
In [6]: df.num_legs > 2
Out[6]:
falcon False
dog True
spider True
fish False
Name: num_legs, dtype: bool
Instead of:
In [9]: np.where(df.num_legs > 2, 1, 0)
Out[9]: array([0, 1, 1, 0])
These probably should be bool dtype columns, but if you insist on using int, just add an .astype(int).
In any case, here is how you might use pd.concat:
df = pd.concat(
[
df,
(df["num_legs"] > 2).rename("custom1"),
(df["num_wings"] == df["num_legs"]).rename("custom1"),
((df["num_wings"] == 0) | (df["num_legs"] == 0)).rename("custom3"),
],
axis=1,
)
Example:
In [10]: df
Out[10]:
num_legs num_wings
falcon 2 2
dog 4 0
spider 8 0
fish 0 0
In [11]: pd.concat(
...: [
...: df,
...: (df["num_legs"] > 2).rename("custom1"),
...: (df["num_wings"] == df["num_legs"]).rename("custom1"),
...: ((df["num_wings"] == 0) | (df["num_legs"] == 0)).rename("custom3"),
...: ],
...: axis=1,
...: )
Out[11]:
num_legs num_wings custom1 custom1 custom3
falcon 2 2 False True False
dog 4 0 True False True
spider 8 0 True False True
fish 0 0 False True True

Pandas: Determine if a string in one column is a substring of a string in another column

Consider these series:
>>> a = pd.Series('abc a abc c'.split())
>>> b = pd.Series('a abc abc a'.split())
>>> pd.concat((a, b), axis=1)
0 1
0 abc a
1 a abc
2 abc abc
3 c a
>>> unknown_operation(a, b)
0 False
1 True
2 True
3 False
The desired logic is to determine if the string in the left column is a substring of the string in the right column. pd.Series.str.contains does not accept another Series, and pd.Series.isin checks if the value exists in the other series (not in the same row specifically). I'm interested to know if there's a vectorized solution (not using .apply or a loop), but it may be that there isn't one.

Let us try with numpy defchararray which is vectorized
from numpy.core.defchararray import find
find(df['1'].values.astype(str),df['0'].values.astype(str))!=-1
Out[740]: array([False, True, True, False])

IIUC,
df[1].str.split('', expand=True).eq(df[0], axis=0).any(axis=1) | df[1].eq(df[0])
Output:
0 False
1 True
2 True
3 False
dtype: bool

I tested various functions with a randomly generated Dataframe of 1,000,000 5 letter entries.
Running on my machine, the averages of 3 tests showed:
zip > v_find > to_list > any > apply
0.21s > 0.79s > 1s > 3.55s > 8.6s
Hence, i would recommend using zip:
[x[0] in x[1] for x in zip(df['A'], df['B'])]
or vectorized find (as proposed by BENY)
np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
My test-setup:
def generate_string(length):
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
A = [generate_string(5) for x in range(n)]
B = [generate_string(5) for y in range(n)]
df = pd.DataFrame({"A": A, "B": B})
to_list = pd.Series([a in b for a, b in df[['A', 'B']].values.tolist()])
apply = df.apply(lambda s: s["A"] in s["B"], axis=1)
v_find = np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
any = df["B"].str.split('', expand=True).eq(df["A"], axis=0).any(axis=1) | df["B"].eq(df["A"])
zip = [x[0] in x[1] for x in zip(df['A'], df['B'])]

Choose the best of three columns

I have a dataset with three columns A, B and C. I want to create a column where I select the two columns closest to each other and take the average. Take the table below as an example:
A B C Best of Three
3 2 5 2.5
4 3 1 3.5
1 5 2 1.5
For the first row, A and B are the closest pair, so the best of three column is (3+2)/2 = 2.5; for the third row, A and C are the closest pair, so the best of three column is (1+2)/2 = 1.5. Below is my code. It is quite unwieldy and quickly become too long if there are more columns. Look forward to suggestions!
data = {'A':[3,4,1],
'B':[2,3,5],
'C':[5,1,2]}
df = pd.DataFrame(data)
df['D'] = abs(df['A'] - df['B'])
df['E'] = abs(df['A'] - df['C'])
df['F'] = abs(df['C'] - df['B'])
df['G'] = min(df['D'], df['E'], df['F'])
if df['G'] = df['D']:
df['Best of Three'] = (df['A'] + df['B'])/2
elif df['G'] = df['E']:
df['Best of Three'] = (df['A'] + df['C'])/2
else:
df['Best of Three'] = (df['B'] + df['C'])/2

First you need a method that finds the minimum diff between 2 elements in a list, the method also returns the median with the 2 values, this is returned as a tuple (diff, median)
def min_list(values):
return min((abs(x - y), (x + y) / 2)
for i, x in enumerate(values)
for y in values[i + 1:])
Then apply it in each row
df = pd.DataFrame([[3, 2, 5, 6], [4, 3, 1, 10], [1, 5, 10, 20]],
columns=['A', 'B', 'C', 'D'])
df['best'] = df.apply(lambda x: min_list(x)[1], axis=1)
print(df)

Functions are your friends. You want to write a function that finds the two closest integers of an list, then pass it the list of the values of the row. Store those results and pass them to a second function that returns the average of two values.
(Also, your code would be much more readable if you replaced D, E, F, and G with descriptively named variables.)

Solve by using itertools combinations generator:
def get_closest_avg(s):
c = list(itertools.combinations(s, 2))
return sum(c[pd.Series(c).apply(lambda x: abs(x[0]-x[1])).idxmin()])/2
df['B3'] = df.apply(get_closest_avg, axis=1)
df:
A B C B3
0 3 2 5 2.5
1 4 3 1 3.5
2 1 5 2 1.5

Comparing two columns in pandas dataframe

I have a pandas dataframe where I would like to verify that column A is greater than column B (row wise). I am doing something like
tmp=df['B']-df['A']
if(any( [ v for v in tmp if v > 0])):
....
I was wondering if there was better(concise) way of doing it or if pandas dataframe had any such built in routines to accomplish this

df = pd.DataFrame({'A': [1, 2, 3], 'B': [3, 1, 1]})
temp = df['B'] - df['A']
print(temp)
0 2
1 -1
2 -2
Now you can create a Boolean series using temp > 0:
print(temp > 0)
0 True
1 False
2 False
dtype: bool
This boolean series can be fed to any and therefore you can use:
if any(temp > 0):
print('juhu!')
Or simply (which avoids temp):
if any(df['B'] > df['A']):
print('juhu')
using the same logic of creating a Boolean series first:
print(df['B'] > df['A'])
0 True
1 False
2 False
dtype: bool

df['B']>df['A'] will be pandas series in boolean datatype.
>>> (df['B']>df['A']).dtype
dtype('bool')
For example
>>> df['B']>df['A']
0 True
1 False
2 False
3 True
4 True
dtype: bool
any() function returns True if any of the item in an iterable is true
>>> if any(df['B']>df['A']):
... print(True)
...
True

I guess you wanted to check if any df[‘B'] > df[‘A’] then do something.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [2, 0, 6, 3]})
if np.where(df['B'] > df['A'], 1, 0).sum():
print('do something')

Assign values in Pandas series based on condition?

I have a dataframe df like
A B
1 2
3 4
I then want to create 2 new series
t = pd.Series()
r = pd.Series()
I was able to assign values to t using the condition cond as below
t = "1+" + df.A.astype(str) + '+' + df.B.astype(str)
cond = df['A']<df['B']
team[cond] = "1+" + df.loc[cond,'B'].astype(str) + '+' + df.loc[cond,'A'].astype(str)
But I'm having problems with r. I just want r to contain values of 2 when con is satisfied and 1 otherwise
If I just try
r = 1
r[cond] = 2
Then I get TypeError: 'int' object does not support item assignment
I figure I could just run a for loop through df and check the cases in cond through each row of df, but I was wondering if Pandas offers a more efficient way instead?

You will laugh at how easy this is:
r = cond + 1
The reason is that cond is a boolean (True and False) which evaluate to 1 and 0. If you add one to it, it coerces the boolean to an int, which will mean True maps to 2 and False maps to one.
df = pd.DataFrame({'A': [1, 3, 4],
'B': [2, 4, 3]})
cond = df['A'] < df['B']
>>> cond + 1
0 2
1 2
2 1
dtype: int64

When you assign 1 to r as in
r = 1
r now references the integer 1. So when you call r[cond] you're treating an integer like a series.
You want to first create a series of ones for r the size of cond. Something like
r = pd.Series(np.ones(cond.shape))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient chaining of boolean indexers in pandas DataFrames - python

Related

How to resolve Pandas performance warning "highly fragmented" after using many custom np.where statements?

Pandas: Determine if a string in one column is a substring of a string in another column

Choose the best of three columns

Comparing two columns in pandas dataframe

Assign values in Pandas series based on condition?

Categories

Resources