Pandas rolling: aggregate boolean values - python

Is there any rolling "any" function in a pandas.DataFrame? Or is there any other way to aggregate boolean values in a rolling function?
Consider:
import pandas as pd
import numpy as np
s = pd.Series([True, True, False, True, False, False, False, True])
# this works but I don't think it is clear enough - I am not
# interested in the sum but a logical or!
s.rolling(2).sum() > 0
# What I would like to have:
s.rolling(2).any()
# AttributeError: 'Rolling' object has no attribute 'any'
s.rolling(2).agg(np.any)
# Same error! AttributeError: 'Rolling' object has no attribute 'any'
So which functions can I use when aggregating booleans? (if numpy.any does not work)
The rolling documentation at https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.rolling.html states that "a Window or Rolling sub-classed for the particular operation" is returned, which doesn't really help.

You aggregate boolean values like this:
# logical or
s.rolling(2).max().astype(bool)
# logical and
s.rolling(2).min().astype(bool)
To deal with the NaN values from incomplete windows, you can use an appropriate fillna before the type conversion, or the min_periods argument of rolling. Depends on the logic you want to implement.
It is a pity this cannot be done in pandas without creating intermediate values as floats.

This method is not implemented, close, what you need is use Rolling.apply:
s = s.rolling(2).apply(lambda x: x.any(), raw=False)
print (s)
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 0.0
6 0.0
7 1.0
dtype: float64
s = s.rolling(2).apply(lambda x: x.any(), raw=False).fillna(0).astype(bool)
print (s)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 True
dtype: bool
Better here is use strides - generate numpy 2d arrays and processing later:
s = pd.Series([True, True, False, True, False, False, False, True])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(s.to_numpy(), 2)
print (a)
[[ True True]
[ True False]
[False True]
[ True False]
[False False]
[False False]
[False True]]
print (np.any(a, axis=1))
[ True True True True False False True]
Here first NaNs pandas values are omitted, you can add first values for processing, here Falses:
n = 2
x = np.concatenate([[False] * (n), s])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(x, n)
print (a)
[[False False]
[False True]
[ True True]
[ True False]
[False True]
[ True False]
[False False]
[False False]
[False True]]
print (np.any(a, axis=1))
[False True True True True True False False True]

def function1(ss:pd.Series):
s[ss.index.max()]=any(ss)
return 0
s.rolling(2).apply(function1).pipe(lambda ss:s)
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 True

Related

How to Calculate Dropoff by Unique Field in Pandas DataFrame with Duplicates

import numpy as np
import pandas as pd
df = pd.DataFrame({
'user' : ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'step_1' : [True, True, True, True, True, True, True],
'step_2' : [True, False, False, True, False, True, True],
'step_3' : [False, False, False, False, False, True, True]
})
print(df)
user step_1 step_2 step_3
0 A True True False
1 A True False False
2 B True False False
3 B True True False
4 B True False False
5 C True True True
6 C True True True
I would like to run the calculation to see what fraction of users get to each step. I have multiple observations of some users, and the order cannot be counted on to simply do a df.drop_duplicates( subset = ['user'] ).
In this case, the answer should be:
Step 1 = 1.00 (because A, B, and C all have a True in Step 1)
Step 2 = 1.00 (A, B, C)
Step 3 = 0.33 (C)
(I do not need to worry about any edge case in which a user goes from False in one step to True in a subsequent step within the same row.)
In your case you can do
df.groupby('user').any().mean()
Out[11]:
step_1 1.000000
step_2 1.000000
step_3 0.333333
dtype: float64

compare rows of a numpy array to all rows

I'm trying to compare each row of a numpy array with the whole numpy array without using iteration.
>>> sample = np.array([[1,2,3],[4,5,6]])
>>> sample
array([[1, 2, 3],
[4, 5, 6]])
First I reshape the 2D-array to a 3D-array:
>>> sample2=sample.reshape(sample.shape[0],1,sample.shape[1])
And then with the following line of code I can compare the rows:
>>> sample2 == sample
array([[[ True, True, True],
[False, False, False]],
[[False, False, False],
[ True, True, True]]])
...which is the result that I'm looking for.
But this does not work with large numpy arrays:
>>> sample3 = np.random.randint(low= 0, high = 2, size = 30000000).reshape(30000,1000)
>>> sample4 = sample3.reshape(sample3.shape[0],1,sample3.shape[1])
>>> sample4 == sample3
<ipython-input-229-e1d55c6bb1ca>:1: DeprecationWarning: elementwise
comparison failed; this will raise an error in the future.
False
How can I solve this?
This may shed some light on your question. Here is my code sample, based on yours:
import numpy as np
n=30000000
ny = 1000
sample3 = np.random.randint(low= 0, high = 2, size = n).reshape(n // ny, ny)
sample4 = sample3.reshape(sample3.shape[0],1,sample3.shape[1])
print(sample3.shape, sample4.shape)
test = sample4 == sample3
print(test)
test = np.equal(sample4, sample3)
print(test)
Its output is:
(30000, 1000) (30000, 1, 1000)
C:\Users\XYZ\python\code_sample.py:7: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
test = sample4 == sample3
False
Traceback (most recent call last):
File "code_sample.py", line 9, in <module>
test = np.equal(sample4, sample3)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 838. GiB for an array with shape (30000, 30000, 1000) and data type bool
Also, here are the docs for numpy.equal() which is presumably used by the == operator for numpy arrays. They sate:
Input arrays. If x1.shape != x2.shape, they must be broadcastable to a common shape (which becomes the shape of the output).
So it looks like equal() may be attempting to use a substantial amount of memory (838 GB in the example above). Perhaps == decides to fail and give the deprecation warning (rather than something more apt, such as an out-of-memory error) when it realizes there's not enough memory?
Also, if I reduce n from 30000000 to 3000000 and comment out the call to equal(), execution of the == statement takes 10 or 20 seconds before the following result is printed:
(3000, 1000) (3000, 1, 1000)
[[[ True True True ... True True True]
[False True True ... True True True]
[ True True True ... True True True]
...
[False True True ... True False True]
[False False True ... True True False]
[ True False False ... False False False]]
[[False True True ... True True True]
[ True True True ... True True True]
[False True True ... True True True]
...
[ True True True ... True False True]
[ True False True ... True True False]
[False False False ... False False False]]
[[ True True True ... True True True]
[False True True ... True True True]
[ True True True ... True True True]
...
[False True True ... True False True]
[False False True ... True True False]
[ True False False ... False False False]]
...
[[False True True ... True False True]
[ True True True ... True False True]
[False True True ... True False True]
...
[ True True True ... True True True]
[ True False True ... True False False]
[False False False ... False True False]]
[[False False True ... True True False]
[ True False True ... True True False]
[False False True ... True True False]
...
[ True False True ... True False False]
[ True True True ... True True True]
[False True False ... False False True]]
[[ True False False ... False False False]
[False False False ... False False False]
[ True False False ... False False False]
...
[False False False ... False True False]
[False True False ... False False True]
[ True True True ... True True True]]
So it looks like the issue you've encountered is probably related to running out of memory.

PANDAS: AND and OR between two dataframes

I have two dataframes of the same size with boolean values. Is there a way to perform the AND, OR or XOR functions between the two dataframes?
for example
df1:
[False, True, False]
[True, False, True]
df2:
[True, False, False]
[True, False, False]
df1 OR df2
[True, True, False]
[True, False,True ]
You might use numpy.logical_and and numpy.logical_or for this task, i.e.:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([[False,True,False],[True,False,True]])
df2 = pd.DataFrame([[True,False,False],[True,False,False]])
dfor = np.logical_or(df1,df2)
print(dfor)
output
0 1 2
0 True True False
1 True False True
df1 = pandas.DataFrame({1: [False, True, False], 2: [True, False, True]})
df2 = pandas.DataFrame({1: [True, False, False], 2: [True, False, False]})
Using OR:
resulting_dataframe_using_OR = pandas.DataFrame({1: (df1[1]|df2[1]), 2: (df1[2]|df2[2])})
Output using OR:
1 2
0 True True
1 True False
2 False True
Using AND:
resulting_dataframe_using_AND = pandas.DataFrame({1: (df1[1]&df2[1]), 2: (df1[2]&df2[2])})
Output using AND:
1 2
0 False True
1 False False
2 False False

Return multiple columns based on date range using pandas

I'm basically trying to calculate revenue to date using pandas. I would like to return N columns consisting of each quarter end. Each column would calculate total revenue to date as of that quarter end. I have:
df['Amortization_per_Day'] = (2.5, 3.2, 5.5, 6.5, 9.2)
df['Start_Date'] = ('1/1/2018', '2/27/2018', '3/31/2018', '5/23/2018', '6/30/2018')
Date_Range = pd.date_range('10/31/2017', periods=75, freq='Q-Jan')
and want to do something like:
df['Amortization_per_Day'] * (('Date_Range' - df['Start_Date']).dt.days + 1)
for each date within the Date_Range. I'm not sure how to pass the Date_Range through the function and to return N columns. I've been reading about zip(*df) and shift but not fully grasping it. Thank you so much for your help.
Solution
Here's a complete solution:
from datetime import datetime
import pandas as pd
df = pd.DataFrame()
df['Amortization_per_Day'] = (2.5, 3.2, 5.5, 6.5, 9.2)
df['Start_Date'] = ('1/1/18', '2/27/18', '3/31/18', '5/23/2018', '6/30/2018')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
dr = pd.date_range('10/31/2017', periods=75, freq='Q-Jan')
def betweendates(x, y):
xv = x.values.astype('datetime64[D]')
xpad = np.zeros(xv.size + 2, dtype=xv.dtype)
xpad[1:-1] = xv
xpad[0],xpad[-1] = np.datetime64(datetime.min), np.datetime64(datetime.max)
yv = y.values.astype('datetime64[D]')
return (xpad[:-1] <= yv[:,None]) & (xpad[1:] >= yv[:,None])
# get a boolean array that indicates which dates in dr are in between which dates in df['Start_Date']
btwn = betweendates(df['Start_Date'], dr)
# based on the boolean array btwn, select out the salient rows from df and dates from dr
dfsel = df[btwn[:, 1:].T]
drsel = dr[btwn[:, 1:].sum(axis=1, dtype=bool)]
# do the actual calculation the OP wanted
dfsel['Amortization_per_Day'] * ((drsel - dfsel['Start_Date']).dt.days + 1)
Output:
0 77.5
2 170.5
4 294.4
4 1140.8
4 1987.2
4 2806.0
4 3652.4
4 4498.8
4 5345.2
4 6173.2
...
4 52394.0
4 53212.8
4 54059.2
4 54905.6
4 55752.0
4 56570.8
4 57417.2
4 58263.6
4 59110.0
4 59938.0
Length: 74, dtype: float64
Explanation
The boolean btwn array looks like this:
[[ True False False False False False]
[False True False False False False]
[False False False True False False]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
...
The ith row of btwn corresponds to the ith datetime in your date range. In each row, exactly one value will be True, and the others will be False. A True value in the 0th column indicates that the datetime is before any of the Start_Times, a True value in the 1st column indicates that the datetime is in between the 0th and the 1st dates in Start_Times, and so forth. A True value in the last column indicates that the datetime is after any of the Start_Times.
By slicing btwn like this:
btwn[:, 1:]
it can be used to match up datetimes in your date range with the immediately preceding Start_Time. If you instead change the slices of btwn to be like this:
btwn[:, :-1]
you would end up matching each datetime to the next Start_Time instead.

Pandas boolean algebra: True if True in both columns

I would like to make a boolean vector that is created by the comparison of two input boolean vectors. I can use a for loop, but is there a better way to do this?
My ideal solution would look like this:
df['A'] = [True, False, False, True]
df['B'] = [True, False, False, False]
C = ((df['A']==True) or (df['B']==True)).as_matrix()
print C
>>> True, False, False, True
I think this is what you are looking for:
C = (df['A']) | (df['B'])
C
0 True
1 False
2 False
3 True
dtype: bool
You could then leave this as a series or convert it to a list or array
Alternatively you could use any method with axis=1 to search in index. It also will work for any number of columns where you have True values:
In [1105]: df
Out[1105]:
B A
0 True True
1 False False
2 False False
3 False True
In [1106]: df.any(axis=1)
Out[1106]:
0 True
1 False
2 False
3 True
dtype: bool

Categories

Resources