2x2 difference table in pandas - python

I'd like to make a table like this in pandas:
I started with a pivot table like this:
# Create data.
df = pd.DataFrame({
'treatment': [True, False, True, False],
'young': [True, True, False, False],
'val': [10, 5, 8, 12]
})
# Pivot.
df.pivot_table('val', 'treatment', 'young')
# young False True
# treatment
# False 12 5
# True 8 10
But had trouble adding a difference row and column. Is there a direct way to add differences as margins to pivot tables?

diff can help here:
# Save the pivot table as we'll use it later.
p = df.pivot_table('val', 'treatment', 'young')
# Add the diff row.
p.loc['diff'] = p.diff().iloc[1]
# Add the diff column.
p['diff'] = p.diff(axis=1).iloc[:, 1]
p
# young False True diff
# treatment
# False 12.0 5.0 -7.0
# True 8.0 10.0 2.0
# diff -4.0 5.0 9.0

Related

How to Calculate Dropoff by Unique Field in Pandas DataFrame with Duplicates

import numpy as np
import pandas as pd
df = pd.DataFrame({
'user' : ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'step_1' : [True, True, True, True, True, True, True],
'step_2' : [True, False, False, True, False, True, True],
'step_3' : [False, False, False, False, False, True, True]
})
print(df)
user step_1 step_2 step_3
0 A True True False
1 A True False False
2 B True False False
3 B True True False
4 B True False False
5 C True True True
6 C True True True
I would like to run the calculation to see what fraction of users get to each step. I have multiple observations of some users, and the order cannot be counted on to simply do a df.drop_duplicates( subset = ['user'] ).
In this case, the answer should be:
Step 1 = 1.00 (because A, B, and C all have a True in Step 1)
Step 2 = 1.00 (A, B, C)
Step 3 = 0.33 (C)
(I do not need to worry about any edge case in which a user goes from False in one step to True in a subsequent step within the same row.)
In your case you can do
df.groupby('user').any().mean()
Out[11]:
step_1 1.000000
step_2 1.000000
step_3 0.333333
dtype: float64

merge two dataframes with some common columns where the combining of the common needs to be a custom function

my question is very similar to Merge pandas dataframe, with column operation but it doesn't answer my needs.
Let's say I have two dataframes such as (note that the dataframe content could be float numbers instead of booleans):
left = pd.DataFrame({0: [True, True, False], 0.5: [False, True, True]}, index=[12.5, 14, 15.5])
right = pd.DataFrame({0.7: [True, False, False], 0.5: [True, False, True]}, index=[12.5, 14, 15.5])
right
0.5 0.7
12.5 True True
14.0 False False
15.5 True False
left
0.0 0.5
12.5 True False
14.0 True True
15.5 False True
As you can see they have the same indexes and one of the column is common. In real life there might be more common columns such as one more at 1.0 or other numbers not yet defined, and more unique columns on each side.
I need to combine the two dataframes such that all unique columns are kept and the common columns are combined using a specific function e.g. a boolean OR for this example, while the indexes are always identical for both dataframes.
So the result should be:
result
0.0 0.5 0.7
12.5 True True True
14.0 True True False
15.5 False True False
In real life there will be more than two dataframes that need to be combined, but they can be combined sequentially one after the other to an empty first dataframe.
I feel pandas.combine might do the trick but I can't figure it out from the documentation. Anybody would have a suggestion on how to do it in one or more steps.
You can concatenate the dataframes, and then groupby the column names to apply an operation on the similarly named columns: In this case you can get away with taking the sum and then typecasting back to bool to get the or operation.
import pandas as pd
df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).sum().astype(bool)
Output:
0.0 0.5 0.7
12.5 True True True
14.0 True True False
15.5 False True False
If you need to see how to do this in a less case-specific manner, then again just group by the columns and apply something to the grouped object over axis=1
df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).apply(lambda x: x.any(1))
# 0.0 0.5 0.7
#12.5 True True True
#14.0 True True False
#15.5 False True False
Further, you can define a custom combining function. Here's one which adds twice the left Frame to 4 times the right Frame. If there is only one column, it returns 2x the left frame.
Sample Data
left:
0.0 0.5
12.5 1 11
14.0 2 17
15.5 3 17
right:
0.7 0.5
12.5 4 2
14.0 4 -1
15.5 5 5
Code
def my_func(x):
try:
res = x.iloc[:, 0]*2 + x.iloc[:, 1]*4
except IndexError:
res = x.iloc[:, 0]*2
return res
df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).apply(lambda x: my_func(x))
Output:
0.0 0.5 0.7
12.5 2 30 8
14.0 4 30 8
15.5 6 54 10
Finally, if you wanted to do this in a consecutive manner, then you should make use of reduce. Here I'll combine 5 DataFrames with the above function. (I'll just repeat the right Frame 4x for the example)
from functools import reduce
def my_comb(df_l, df_r, func):
""" Concatenate df_l and df_r along axis=1. Apply the
specified function.
"""
df = pd.concat([df_l, df_r], 1)
return df.groupby(df.columns, 1).apply(lambda x: func(x))
reduce(lambda dfl, dfr: my_comb(dfl, dfr, func=my_func), [left, right, right, right, right])
# 0.0 0.5 0.7
#12.5 16 296 176
#14.0 32 212 176
#15.5 48 572 220

Numpy/Pandas clean way to check if a specific value is NaN

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

Pandas not dropping columns

Hi I've tried to drop columns based on a boolean array but for some odd reason pandas does not seem to be dropping the columns at all.
The boolean array is and (376,). It only contains True and False values.
for x in range(0,len(analysis)-1):
if analysis[x] == False:
col = dtest.columns[x]
dtest.drop(dtest.columns[x],1)
This is my code for dropping the columns, essentially the length of the analysis array is the number of columns there is in dtest.
dtest is (4209, 376) & pandas.core.frame.DataFrame
I have tried debugging, it does detect the Falses in the analysis and also is able to print out the col variable accurately but it just wont drop the columns for some reason.
Would greatly appreciate any help! Thanks :)
IIUC you don't need loop:
dtest = dtest.loc[:, analysis]
Demo:
In [320]: df = pd.DataFrame(np.random.rand(5, 10), columns=list(range(1, 11)))
In [321]: df
Out[321]:
1 2 3 4 5 6 7 8 9 10
0 0.332792 0.927047 0.899874 0.294391 0.762800 0.861521 0.988783 0.475127 0.033096 0.980141
1 0.447273 0.268828 0.951633 0.947425 0.020006 0.808608 0.607091 0.712309 0.383256 0.248582
2 0.169946 0.951702 0.671014 0.514326 0.607129 0.227021 0.831474 0.696117 0.799418 0.224851
3 0.724165 0.748455 0.452430 0.941572 0.873344 0.877872 0.925788 0.183115 0.113217 0.072717
4 0.303488 0.426459 0.750076 0.225662 0.298983 0.729585 0.692489 0.934778 0.124634 0.274208
In [322]: analysis = np.random.choice([True, False], 10)
In [323]: analysis
Out[323]: array([ True, True, True, False, True, True, True, False, False, True], dtype=bool)
In [324]: df = df.loc[:, analysis]
In [325]: df
Out[325]:
1 2 3 5 6 7 10
0 0.332792 0.927047 0.899874 0.762800 0.861521 0.988783 0.980141
1 0.447273 0.268828 0.951633 0.020006 0.808608 0.607091 0.248582
2 0.169946 0.951702 0.671014 0.607129 0.227021 0.831474 0.224851
3 0.724165 0.748455 0.452430 0.873344 0.877872 0.925788 0.072717
4 0.303488 0.426459 0.750076 0.298983 0.729585 0.692489 0.274208
You need assign output back:
cols = []
for x in range(0,len(analysis)):
if analysis[x] == False:
col = dtest.columns[x]
cols.append(col)
dtest = dtest.drop(cols,1)
print (dtest)
0 2
0 1 3
but better is select only columns by True mask like in another answer.

Comparing values of all columns except one

Shown below is the code that compares column values to a constant.
My questions:
Why does the ">=" comparison show "False" for 0.005000 for row "a". I expect it to be true.
Is it possible to repeat the comparison for all columns except the first and "AND" the results
Sorry could not format the code properly.
import numpy as np
import pandas as pd
def test_pct_change():
MIN_CHANGE = 0.0050 #.5% For some reason 0.0050 does not work in comparison
data = { 'c1' : pd.Series([100, 110], index=['a', 'b']),
'c2' : pd.Series([100.5, 105, 3.,], index=['a', 'b', 'c']),
'c3' : pd.Series([102, 100, 3.], index=['a', 'b', 'c'])}
df = pd.DataFrame(data)
print df.to_string()
dft_pct = df.pct_change(axis=1) #1: columns
dft_pct['Has_Min_Change'] = (dft_pct.iloc[:, -2] >= MIN_CHANGE) #(dft_pct.iloc[:, -1] >= MIN_CHANGE) &
print 'Percent Change'
print dft_pct.to_string()
This is why numpy has isclose
Consider the dataframe df
df = pd.DataFrame(np.random.rand(5, 5))
print(df)
0 1 2 3 4
0 0.362368 0.201145 0.340571 0.733402 0.816436
1 0.216386 0.105877 0.565318 0.102514 0.451794
2 0.221733 0.216303 0.039209 0.482731 0.800290
3 0.200427 0.154020 0.612884 0.695920 0.122780
4 0.986003 0.059244 0.291480 0.270779 0.526996
Evaluate an equality we know to be mathematically true
((100 + df) / 100 - 1) == (df / 100)
0 1 2 3 4
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
Let's look at the difference.
We can round to 15 decimal places and it still comes back all zeros.
These are really close.
print(((100 + df) / 100 - 1).sub(df / 100).round(15))
0 1 2 3 4
0 -0.0 0.0 0.0 0.0 0.0
1 -0.0 0.0 0.0 0.0 0.0
2 -0.0 -0.0 0.0 -0.0 -0.0
3 -0.0 0.0 0.0 -0.0 0.0
4 0.0 -0.0 -0.0 0.0 0.0
This is why numpy has isclose
np.isclose(((100 + df) / 100 - 1), (df / 100))
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
This is the consequence of using binary gates perform decimal math, and we have a work around.
When computations are performed in double precision, 100.5/100 is slightly less than 0.005, and so (100.5/100-1) >= 0.005 evaluates to False. This is why you don't get "Min Change" for the change from 100 to 100.5
If it's really important that such edge cases be included, you can fudge it slightly with the inequality like >= MIN_CHANGE - 1e-15.
One way to represent the condition that all columns satisfy >= MIN_CHANGE is to take minimum over columns, and require that to be >= MIN_CHANGE. Example:
dft_pct['Has_Min_Change'] = dft_pct.min(axis=1) >= MIN_CHANGE
By default, min ignores NaN entries. (Watch out for implicit conversion of Booleans to ints, however: False is treated by it as 0).

Categories

Resources