Pandas boolean algebra: True if True in both columns - python

I would like to make a boolean vector that is created by the comparison of two input boolean vectors. I can use a for loop, but is there a better way to do this?
My ideal solution would look like this:
df['A'] = [True, False, False, True]
df['B'] = [True, False, False, False]
C = ((df['A']==True) or (df['B']==True)).as_matrix()
print C
>>> True, False, False, True

I think this is what you are looking for:
C = (df['A']) | (df['B'])
C
0 True
1 False
2 False
3 True
dtype: bool
You could then leave this as a series or convert it to a list or array

Alternatively you could use any method with axis=1 to search in index. It also will work for any number of columns where you have True values:
In [1105]: df
Out[1105]:
B A
0 True True
1 False False
2 False False
3 False True
In [1106]: df.any(axis=1)
Out[1106]:
0 True
1 False
2 False
3 True
dtype: bool

Related

How to Calculate Dropoff by Unique Field in Pandas DataFrame with Duplicates

import numpy as np
import pandas as pd
df = pd.DataFrame({
'user' : ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'step_1' : [True, True, True, True, True, True, True],
'step_2' : [True, False, False, True, False, True, True],
'step_3' : [False, False, False, False, False, True, True]
})
print(df)
user step_1 step_2 step_3
0 A True True False
1 A True False False
2 B True False False
3 B True True False
4 B True False False
5 C True True True
6 C True True True
I would like to run the calculation to see what fraction of users get to each step. I have multiple observations of some users, and the order cannot be counted on to simply do a df.drop_duplicates( subset = ['user'] ).
In this case, the answer should be:
Step 1 = 1.00 (because A, B, and C all have a True in Step 1)
Step 2 = 1.00 (A, B, C)
Step 3 = 0.33 (C)
(I do not need to worry about any edge case in which a user goes from False in one step to True in a subsequent step within the same row.)
In your case you can do
df.groupby('user').any().mean()
Out[11]:
step_1 1.000000
step_2 1.000000
step_3 0.333333
dtype: float64

Pandas rolling: aggregate boolean values

Is there any rolling "any" function in a pandas.DataFrame? Or is there any other way to aggregate boolean values in a rolling function?
Consider:
import pandas as pd
import numpy as np
s = pd.Series([True, True, False, True, False, False, False, True])
# this works but I don't think it is clear enough - I am not
# interested in the sum but a logical or!
s.rolling(2).sum() > 0
# What I would like to have:
s.rolling(2).any()
# AttributeError: 'Rolling' object has no attribute 'any'
s.rolling(2).agg(np.any)
# Same error! AttributeError: 'Rolling' object has no attribute 'any'
So which functions can I use when aggregating booleans? (if numpy.any does not work)
The rolling documentation at https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.rolling.html states that "a Window or Rolling sub-classed for the particular operation" is returned, which doesn't really help.
You aggregate boolean values like this:
# logical or
s.rolling(2).max().astype(bool)
# logical and
s.rolling(2).min().astype(bool)
To deal with the NaN values from incomplete windows, you can use an appropriate fillna before the type conversion, or the min_periods argument of rolling. Depends on the logic you want to implement.
It is a pity this cannot be done in pandas without creating intermediate values as floats.
This method is not implemented, close, what you need is use Rolling.apply:
s = s.rolling(2).apply(lambda x: x.any(), raw=False)
print (s)
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 0.0
6 0.0
7 1.0
dtype: float64
s = s.rolling(2).apply(lambda x: x.any(), raw=False).fillna(0).astype(bool)
print (s)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 True
dtype: bool
Better here is use strides - generate numpy 2d arrays and processing later:
s = pd.Series([True, True, False, True, False, False, False, True])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(s.to_numpy(), 2)
print (a)
[[ True True]
[ True False]
[False True]
[ True False]
[False False]
[False False]
[False True]]
print (np.any(a, axis=1))
[ True True True True False False True]
Here first NaNs pandas values are omitted, you can add first values for processing, here Falses:
n = 2
x = np.concatenate([[False] * (n), s])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(x, n)
print (a)
[[False False]
[False True]
[ True True]
[ True False]
[False True]
[ True False]
[False False]
[False False]
[False True]]
print (np.any(a, axis=1))
[False True True True True True False False True]
def function1(ss:pd.Series):
s[ss.index.max()]=any(ss)
return 0
s.rolling(2).apply(function1).pipe(lambda ss:s)
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 True

Return multiple columns based on date range using pandas

I'm basically trying to calculate revenue to date using pandas. I would like to return N columns consisting of each quarter end. Each column would calculate total revenue to date as of that quarter end. I have:
df['Amortization_per_Day'] = (2.5, 3.2, 5.5, 6.5, 9.2)
df['Start_Date'] = ('1/1/2018', '2/27/2018', '3/31/2018', '5/23/2018', '6/30/2018')
Date_Range = pd.date_range('10/31/2017', periods=75, freq='Q-Jan')
and want to do something like:
df['Amortization_per_Day'] * (('Date_Range' - df['Start_Date']).dt.days + 1)
for each date within the Date_Range. I'm not sure how to pass the Date_Range through the function and to return N columns. I've been reading about zip(*df) and shift but not fully grasping it. Thank you so much for your help.
Solution
Here's a complete solution:
from datetime import datetime
import pandas as pd
df = pd.DataFrame()
df['Amortization_per_Day'] = (2.5, 3.2, 5.5, 6.5, 9.2)
df['Start_Date'] = ('1/1/18', '2/27/18', '3/31/18', '5/23/2018', '6/30/2018')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
dr = pd.date_range('10/31/2017', periods=75, freq='Q-Jan')
def betweendates(x, y):
xv = x.values.astype('datetime64[D]')
xpad = np.zeros(xv.size + 2, dtype=xv.dtype)
xpad[1:-1] = xv
xpad[0],xpad[-1] = np.datetime64(datetime.min), np.datetime64(datetime.max)
yv = y.values.astype('datetime64[D]')
return (xpad[:-1] <= yv[:,None]) & (xpad[1:] >= yv[:,None])
# get a boolean array that indicates which dates in dr are in between which dates in df['Start_Date']
btwn = betweendates(df['Start_Date'], dr)
# based on the boolean array btwn, select out the salient rows from df and dates from dr
dfsel = df[btwn[:, 1:].T]
drsel = dr[btwn[:, 1:].sum(axis=1, dtype=bool)]
# do the actual calculation the OP wanted
dfsel['Amortization_per_Day'] * ((drsel - dfsel['Start_Date']).dt.days + 1)
Output:
0 77.5
2 170.5
4 294.4
4 1140.8
4 1987.2
4 2806.0
4 3652.4
4 4498.8
4 5345.2
4 6173.2
...
4 52394.0
4 53212.8
4 54059.2
4 54905.6
4 55752.0
4 56570.8
4 57417.2
4 58263.6
4 59110.0
4 59938.0
Length: 74, dtype: float64
Explanation
The boolean btwn array looks like this:
[[ True False False False False False]
[False True False False False False]
[False False False True False False]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
...
The ith row of btwn corresponds to the ith datetime in your date range. In each row, exactly one value will be True, and the others will be False. A True value in the 0th column indicates that the datetime is before any of the Start_Times, a True value in the 1st column indicates that the datetime is in between the 0th and the 1st dates in Start_Times, and so forth. A True value in the last column indicates that the datetime is after any of the Start_Times.
By slicing btwn like this:
btwn[:, 1:]
it can be used to match up datetimes in your date range with the immediately preceding Start_Time. If you instead change the slices of btwn to be like this:
btwn[:, :-1]
you would end up matching each datetime to the next Start_Time instead.

Python | count number of False statements in 3 rows

Columns L,M,N of my dataframe are populated with 'true' and 'false' statements(1000 rows). I would like to create a new column 'count_false' that will return the number of times 'false' statement occurred in columns L,M and N.
Any tips appreciated!
Thank you.
You can negate your dataframe and sum over axis=1:
df = pd.DataFrame(np.random.randint(0, 2, (5, 3)), columns=list('LMN')).astype(bool)
df['Falses'] = (~df).sum(1)
print(df)
L M N Falses
0 True False True 1
1 True False False 2
2 True True True 0
3 False True False 2
4 False False True 2
If you have additional columns, you can filter accordingly:
df['Falses'] = (~df[list('LMN')]).sum(1)
Try this : df[df==false].count()
As explained in this Stack question True and False equal to 1 and 0 in Python, therefore something like the line three of the following example should solve your problem:
import pandas as pd
df = pd.DataFrame([[True, False, True],[False, False, False],[True, False, True],[False, False, True]], columns=['L','M','N'])
df['count_false'] = 3 - (df['L']*1 + df['M']*1 + df['N']*1)

Numpy/Pandas clean way to check if a specific value is NaN

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

Categories

Resources