How to Calculate Dropoff by Unique Field in Pandas DataFrame with Duplicates - python

import numpy as np
import pandas as pd
df = pd.DataFrame({
'user' : ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'step_1' : [True, True, True, True, True, True, True],
'step_2' : [True, False, False, True, False, True, True],
'step_3' : [False, False, False, False, False, True, True]
})
print(df)
user step_1 step_2 step_3
0 A True True False
1 A True False False
2 B True False False
3 B True True False
4 B True False False
5 C True True True
6 C True True True
I would like to run the calculation to see what fraction of users get to each step. I have multiple observations of some users, and the order cannot be counted on to simply do a df.drop_duplicates( subset = ['user'] ).
In this case, the answer should be:
Step 1 = 1.00 (because A, B, and C all have a True in Step 1)
Step 2 = 1.00 (A, B, C)
Step 3 = 0.33 (C)
(I do not need to worry about any edge case in which a user goes from False in one step to True in a subsequent step within the same row.)

In your case you can do
df.groupby('user').any().mean()
Out[11]:
step_1 1.000000
step_2 1.000000
step_3 0.333333
dtype: float64

Related

PANDAS: AND and OR between two dataframes

I have two dataframes of the same size with boolean values. Is there a way to perform the AND, OR or XOR functions between the two dataframes?
for example
df1:
[False, True, False]
[True, False, True]
df2:
[True, False, False]
[True, False, False]
df1 OR df2
[True, True, False]
[True, False,True ]
You might use numpy.logical_and and numpy.logical_or for this task, i.e.:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([[False,True,False],[True,False,True]])
df2 = pd.DataFrame([[True,False,False],[True,False,False]])
dfor = np.logical_or(df1,df2)
print(dfor)
output
0 1 2
0 True True False
1 True False True
df1 = pandas.DataFrame({1: [False, True, False], 2: [True, False, True]})
df2 = pandas.DataFrame({1: [True, False, False], 2: [True, False, False]})
Using OR:
resulting_dataframe_using_OR = pandas.DataFrame({1: (df1[1]|df2[1]), 2: (df1[2]|df2[2])})
Output using OR:
1 2
0 True True
1 True False
2 False True
Using AND:
resulting_dataframe_using_AND = pandas.DataFrame({1: (df1[1]&df2[1]), 2: (df1[2]&df2[2])})
Output using AND:
1 2
0 False True
1 False False
2 False False

All possible outcomes with large dataset in Pandas and sorting the results

I have a DataFrame with 700 rows and 5100 columns. Each row contain True or False. With this df, I want to test all possible combinations of the columns, and with the results test if each row is equal to True.
I recieved excellent help from a fellow user the other day in this thread: How to test all possible combinations with True/False Statement in python?, suggesting me to use "combinations" from itertools and "product".
This works fine working with a small dataset. However, when applying this method to my (much larger) dataset, I run out of memory when testing combinations of more than 2.
My desired output would be likewise to the example below, but a way where I dont run out of memory.
Thank you for any help.
Suggested method with small dataset:
import pandas as pd
from itertools import combinations
df1 = pd.DataFrame({"Main1": [True, False, False, False, False, True, True],
"Main2": [False, False, True, False, True, True, False],
"Main3": [True, False, True, True, True, True, False]})
df2 = pd.DataFrame({"Sub1": [False, False, True, False, True, False, True],
"Sub2": [False, True, False, False, True, False, True],
"Sub3": [True, False, True, False, False, False, True]})
df3 = df1.join(df2)
all_combinations = list(combinations(df3.columns, 2)) + \
list(combinations(df3.columns, 3))
for combination in all_combinations:
df3["".join(list(combination))] = df3[list(combination)].product(axis=1).astype(bool)
df3.drop(labels=["Main1", "Main2", "Main3", "Sub1", "Sub2", "Sub3"], axis=1, inplace=True)
df3
Main1Main2 Main1Main3 ... Main3Sub2Sub3 Sub1Sub2Sub3
0 False True ... False False
1 False False ... False False
2 False False ... False False
3 False False ... False False
4 False False ... False False
5 True True ... False False
6 False False ... False True
So, I'm not real proud of this one, but maybe it has a chance of victory... :)
I think you need to get out of the data frame, because it cannot grow large enough to hold your results properly. If your results are predictably sparse, you could use an alternate structure, like below.
Note this will be a long loop for what you are doing, 22B x length of data frame, so over a trillion hits, but if you only have to do it once, who cares. The combinations function in itertools is a generator, so it will be memory efficient.
I think you are looking for results that are "all True" above as you are using the product operator. I mis-stated in comments.
You could add to this below with a second loop to cover the combinations of size 2 if it ever completes! :)
import pandas as pd
from itertools import combinations
df = pd.DataFrame({ "Main1": [True, False, False, False, False, True, True],
"Main2": [False, False, True, False, True, True, False],
"Main3": [True, False, True, True, True, True, False],
"Sub1": [False, False, True, False, True, False, True],
"Sub2": [False, True, False, False, True, False, True],
"Sub3": [True, False, True, False, False, False, True]})
print(df)
data = df.to_dict('index')
# test to see if it looks right for row 0
print(data[0])
# now the data is in a nested dictionary, which should be more "iterable"
results = []
for combo in combinations(df.columns, 3):
for key in data: # iterate through the rows in the data... index is key.
values = set(data[key][col] for col in combo)
if all(values):
results.append((key, combo))
# inspect results...
for result in results:
print(f'row: {result[0]} columns: {results[1]} product is TRUE')
Yields:
Main1 Main2 Main3 Sub1 Sub2 Sub3
0 True False True False False True
1 False False False False True False
2 False True True True False True
3 False False True False False False
4 False True True True True False
5 True True True False False False
6 True False False True True True
{'Main1': True, 'Main2': False, 'Main3': True, 'Sub1': False, 'Sub2': False, 'Sub3': True}
row: 5 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 0 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 6 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 6 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 6 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 2 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 4 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 4 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 2 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 4 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 2 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 4 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 2 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
row: 6 columns: (0, ('Main1', 'Main3', 'Sub3')) product is TRUE
[Finished in 0.6s]

Is there a way to drop all duplicates in a df and add their occurrence in a prexisting column?

I would like to drop all duplicates within my df and add their occurrence in a prexisting column, e.g. 'four'.
df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
'two': pd.Series([True, False, False, True]),
'three': pd.Series([True, False, False, False]),
'four': pd.Series([1,1,1,1])})
one two three four
0 True True True 1
1 True False False 1
2 True False False 1
3 False True False 1
Should look like this:
one two three four
0 True True True 1
1 True False False 2
2 False True False 1
You can use groupby and sum the aggregation function:
df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
'two': pd.Series([True, False, False, True]),
'three': pd.Series([True, False, False, False]),
'four': pd.Series([1, 1, 1, 1])})
print(df.groupby(['one', 'two', 'three'], sort=False).sum().reset_index())
Outputs
one two three four
0 True True True 1
1 True False False 2
2 False True False 1

Comparing values of all columns except one

Shown below is the code that compares column values to a constant.
My questions:
Why does the ">=" comparison show "False" for 0.005000 for row "a". I expect it to be true.
Is it possible to repeat the comparison for all columns except the first and "AND" the results
Sorry could not format the code properly.
import numpy as np
import pandas as pd
def test_pct_change():
MIN_CHANGE = 0.0050 #.5% For some reason 0.0050 does not work in comparison
data = { 'c1' : pd.Series([100, 110], index=['a', 'b']),
'c2' : pd.Series([100.5, 105, 3.,], index=['a', 'b', 'c']),
'c3' : pd.Series([102, 100, 3.], index=['a', 'b', 'c'])}
df = pd.DataFrame(data)
print df.to_string()
dft_pct = df.pct_change(axis=1) #1: columns
dft_pct['Has_Min_Change'] = (dft_pct.iloc[:, -2] >= MIN_CHANGE) #(dft_pct.iloc[:, -1] >= MIN_CHANGE) &
print 'Percent Change'
print dft_pct.to_string()
This is why numpy has isclose
Consider the dataframe df
df = pd.DataFrame(np.random.rand(5, 5))
print(df)
0 1 2 3 4
0 0.362368 0.201145 0.340571 0.733402 0.816436
1 0.216386 0.105877 0.565318 0.102514 0.451794
2 0.221733 0.216303 0.039209 0.482731 0.800290
3 0.200427 0.154020 0.612884 0.695920 0.122780
4 0.986003 0.059244 0.291480 0.270779 0.526996
Evaluate an equality we know to be mathematically true
((100 + df) / 100 - 1) == (df / 100)
0 1 2 3 4
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
Let's look at the difference.
We can round to 15 decimal places and it still comes back all zeros.
These are really close.
print(((100 + df) / 100 - 1).sub(df / 100).round(15))
0 1 2 3 4
0 -0.0 0.0 0.0 0.0 0.0
1 -0.0 0.0 0.0 0.0 0.0
2 -0.0 -0.0 0.0 -0.0 -0.0
3 -0.0 0.0 0.0 -0.0 0.0
4 0.0 -0.0 -0.0 0.0 0.0
This is why numpy has isclose
np.isclose(((100 + df) / 100 - 1), (df / 100))
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
This is the consequence of using binary gates perform decimal math, and we have a work around.
When computations are performed in double precision, 100.5/100 is slightly less than 0.005, and so (100.5/100-1) >= 0.005 evaluates to False. This is why you don't get "Min Change" for the change from 100 to 100.5
If it's really important that such edge cases be included, you can fudge it slightly with the inequality like >= MIN_CHANGE - 1e-15.
One way to represent the condition that all columns satisfy >= MIN_CHANGE is to take minimum over columns, and require that to be >= MIN_CHANGE. Example:
dft_pct['Has_Min_Change'] = dft_pct.min(axis=1) >= MIN_CHANGE
By default, min ignores NaN entries. (Watch out for implicit conversion of Booleans to ints, however: False is treated by it as 0).

Pandas boolean algebra: True if True in both columns

I would like to make a boolean vector that is created by the comparison of two input boolean vectors. I can use a for loop, but is there a better way to do this?
My ideal solution would look like this:
df['A'] = [True, False, False, True]
df['B'] = [True, False, False, False]
C = ((df['A']==True) or (df['B']==True)).as_matrix()
print C
>>> True, False, False, True
I think this is what you are looking for:
C = (df['A']) | (df['B'])
C
0 True
1 False
2 False
3 True
dtype: bool
You could then leave this as a series or convert it to a list or array
Alternatively you could use any method with axis=1 to search in index. It also will work for any number of columns where you have True values:
In [1105]: df
Out[1105]:
B A
0 True True
1 False False
2 False False
3 False True
In [1106]: df.any(axis=1)
Out[1106]:
0 True
1 False
2 False
3 True
dtype: bool

Categories

Resources