import numpy as np
import pandas as pd
df = pd.DataFrame({
'user' : ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'step_1' : [True, True, True, True, True, True, True],
'step_2' : [True, False, False, True, False, True, True],
'step_3' : [False, False, False, False, False, True, True]
})
print(df)
user step_1 step_2 step_3
0 A True True False
1 A True False False
2 B True False False
3 B True True False
4 B True False False
5 C True True True
6 C True True True
I would like to run the calculation to see what fraction of users get to each step. I have multiple observations of some users, and the order cannot be counted on to simply do a df.drop_duplicates( subset = ['user'] ).
In this case, the answer should be:
Step 1 = 1.00 (because A, B, and C all have a True in Step 1)
Step 2 = 1.00 (A, B, C)
Step 3 = 0.33 (C)
(I do not need to worry about any edge case in which a user goes from False in one step to True in a subsequent step within the same row.)
In your case you can do
df.groupby('user').any().mean()
Out[11]:
step_1 1.000000
step_2 1.000000
step_3 0.333333
dtype: float64
I have a DataFrame with 700 rows and 5100 columns. Each row contain True or False. With this df, I want to test all possible combinations of the columns, and with the results test if each row is equal to True.
I recieved excellent help from a fellow user the other day in this thread: How to test all possible combinations with True/False Statement in python?, suggesting me to use "combinations" from itertools and "product".
This works fine working with a small dataset. However, when applying this method to my (much larger) dataset, I run out of memory when testing combinations of more than 2.
My desired output would be likewise to the example below, but a way where I dont run out of memory.
Thank you for any help.
Suggested method with small dataset:
import pandas as pd
from itertools import combinations
df1 = pd.DataFrame({"Main1": [True, False, False, False, False, True, True],
"Main2": [False, False, True, False, True, True, False],
"Main3": [True, False, True, True, True, True, False]})
df2 = pd.DataFrame({"Sub1": [False, False, True, False, True, False, True],
"Sub2": [False, True, False, False, True, False, True],
"Sub3": [True, False, True, False, False, False, True]})
df3 = df1.join(df2)
all_combinations = list(combinations(df3.columns, 2)) + \
list(combinations(df3.columns, 3))
for combination in all_combinations:
df3["".join(list(combination))] = df3[list(combination)].product(axis=1).astype(bool)
df3.drop(labels=["Main1", "Main2", "Main3", "Sub1", "Sub2", "Sub3"], axis=1, inplace=True)
df3
Main1Main2 Main1Main3 ... Main3Sub2Sub3 Sub1Sub2Sub3
0 False True ... False False
1 False False ... False False
2 False False ... False False
3 False False ... False False
4 False False ... False False
5 True True ... False False
6 False False ... False True
So, I'm not real proud of this one, but maybe it has a chance of victory... :)
I think you need to get out of the data frame, because it cannot grow large enough to hold your results properly. If your results are predictably sparse, you could use an alternate structure, like below.
Note this will be a long loop for what you are doing, 22B x length of data frame, so over a trillion hits, but if you only have to do it once, who cares. The combinations function in itertools is a generator, so it will be memory efficient.
I think you are looking for results that are "all True" above as you are using the product operator. I mis-stated in comments.
You could add to this below with a second loop to cover the combinations of size 2 if it ever completes! :)
import pandas as pd
from itertools import combinations
df = pd.DataFrame({ "Main1": [True, False, False, False, False, True, True],
"Main2": [False, False, True, False, True, True, False],
"Main3": [True, False, True, True, True, True, False],
"Sub1": [False, False, True, False, True, False, True],
"Sub2": [False, True, False, False, True, False, True],
"Sub3": [True, False, True, False, False, False, True]})
print(df)
data = df.to_dict('index')
# test to see if it looks right for row 0
print(data[0])
# now the data is in a nested dictionary, which should be more "iterable"
results = []
for combo in combinations(df.columns, 3):
for key in data: # iterate through the rows in the data... index is key.
values = set(data[key][col] for col in combo)
if all(values):
results.append((key, combo))
# inspect results...
for result in results:
print(f'row: {result[0]} columns: {results[1]} product is TRUE')
Yields:
Main1 Main2 Main3 Sub1 Sub2 Sub3
0 True False True False False True
1 False False False False True False
2 False True True True False True
3 False False True False False False
4 False True True True True False
5 True True True False False False
6 True False False True True True
{'Main1': True, 'Main2': False, 'Main3': True, 'Sub1': False, 'Sub2': False, 'Sub3': True}
row: 5 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 0 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 6 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 6 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 6 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 2 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 4 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 4 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 2 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 4 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 2 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 4 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 2 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 6 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
[Finished in 0.6s]
Is there any rolling "any" function in a pandas.DataFrame? Or is there any other way to aggregate boolean values in a rolling function?
Consider:
import pandas as pd
import numpy as np
s = pd.Series([True, True, False, True, False, False, False, True])
# this works but I don't think it is clear enough - I am not
# interested in the sum but a logical or!
s.rolling(2).sum() > 0
# What I would like to have:
s.rolling(2).any()
# AttributeError: 'Rolling' object has no attribute 'any'
s.rolling(2).agg(np.any)
# Same error! AttributeError: 'Rolling' object has no attribute 'any'
So which functions can I use when aggregating booleans? (if numpy.any does not work)
The rolling documentation at https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.rolling.html states that "a Window or Rolling sub-classed for the particular operation" is returned, which doesn't really help.
You aggregate boolean values like this:
# logical or
s.rolling(2).max().astype(bool)
# logical and
s.rolling(2).min().astype(bool)
To deal with the NaN values from incomplete windows, you can use an appropriate fillna before the type conversion, or the min_periods argument of rolling. Depends on the logic you want to implement.
It is a pity this cannot be done in pandas without creating intermediate values as floats.
This method is not implemented, close, what you need is use Rolling.apply:
s = s.rolling(2).apply(lambda x: x.any(), raw=False)
print (s)
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 0.0
6 0.0
7 1.0
dtype: float64
s = s.rolling(2).apply(lambda x: x.any(), raw=False).fillna(0).astype(bool)
print (s)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 True
dtype: bool
Better here is use strides - generate numpy 2d arrays and processing later:
s = pd.Series([True, True, False, True, False, False, False, True])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(s.to_numpy(), 2)
print (a)
[[ True True]
[ True False]
[False True]
[ True False]
[False False]
[False False]
[False True]]
print (np.any(a, axis=1))
[ True True True True False False True]
Here first NaNs pandas values are omitted, you can add first values for processing, here Falses:
n = 2
x = np.concatenate([[False] * (n), s])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(x, n)
print (a)
[[False False]
[False True]
[ True True]
[ True False]
[False True]
[ True False]
[False False]
[False False]
[False True]]
print (np.any(a, axis=1))
[False True True True True True False False True]
def function1(ss:pd.Series):
s[ss.index.max()]=any(ss)
return 0
s.rolling(2).apply(function1).pipe(lambda ss:s)
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 True
I would like to drop all duplicates within my df and add their occurrence in a prexisting column, e.g. 'four'.
df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
'two': pd.Series([True, False, False, True]),
'three': pd.Series([True, False, False, False]),
'four': pd.Series([1,1,1,1])})
one two three four
0 True True True 1
1 True False False 1
2 True False False 1
3 False True False 1
Should look like this:
one two three four
0 True True True 1
1 True False False 2
2 False True False 1
You can use groupby and sum the aggregation function:
df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
'two': pd.Series([True, False, False, True]),
'three': pd.Series([True, False, False, False]),
'four': pd.Series([1, 1, 1, 1])})
print(df.groupby(['one', 'two', 'three'], sort=False).sum().reset_index())
Outputs
one two three four
0 True True True 1
1 True False False 2
2 False True False 1
I would like to make a boolean vector that is created by the comparison of two input boolean vectors. I can use a for loop, but is there a better way to do this?
My ideal solution would look like this:
df['A'] = [True, False, False, True]
df['B'] = [True, False, False, False]
C = ((df['A']==True) or (df['B']==True)).as_matrix()
print C
>>> True, False, False, True
I think this is what you are looking for:
C = (df['A']) | (df['B'])
C
0 True
1 False
2 False
3 True
dtype: bool
You could then leave this as a series or convert it to a list or array
Alternatively you could use any method with axis=1 to search in index. It also will work for any number of columns where you have True values:
In [1105]: df
Out[1105]:
B A
0 True True
1 False False
2 False False
3 False True
In [1106]: df.any(axis=1)
Out[1106]:
0 True
1 False
2 False
3 True
dtype: bool