Pandas comparing multiindex dataframes without looping - python

I want to compare two multiindex dataframes and add another column to show the difference in values (if all index value match between the first dataframe and second dataframe) without using loops
index_a = [1,2,2,3,3,3]
index_b = [0,0,1,0,1,2]
index_c = [1,2,2,4,4,4]
index = pd.MultiIndex.from_arrays([index_a,index_b], names=('a','b'))
index_1 = pd.MultiIndex.from_arrays([index_c,index_b], names=('a','b'))
df1 = pd.DataFrame(np.random.rand(6,), index=index, columns=['p'])
df2 = pd.DataFrame(np.random.rand(6,), index=index_1, columns=['q'])
df1
p
a b
1 0 .4655
2 0 .8600
1 .9010
3 0 .0652
1 .5686
2 .8965
df2
q
a b
1 0 .6591
2 0 .5684
1 .5689
4 0 .9898
1 .3656
2 .6989
The resultant matrix (df1-df2) should look like
p diff
a b
1 0 .4655 -0.1936
2 0 .8600 .2916
1 .9010 .3321
3 0 .0652 No Match
1 .5686 No Match
2 .8965 No Match

Use reindex_like or reindex for intersection of indices:
df1['new'] = (df1['p'] - df2['q'].reindex_like(df1)).fillna('No Match')
#alternative
#df1['new'] = (df1['p'] - df2['q'].reindex(df1.index)).fillna('No Match')
print (df1)
p new
a b
1 0 0.955587 0.924466
2 0 0.312497 -0.310224
1 0.306256 0.231646
3 0 0.575613 No Match
1 0.674605 No Match
2 0.462807 No Match
Another idea with Index.intersection and DataFrame.loc:
df1['new'] = (df1['p'] - df2.loc[df2.index.intersection(df1.index), 'q']).fillna('No Match')
Or with merge with left join:
df = pd.merge(df1, df2, how='left', left_index=True, right_index=True)
df['new'] = (df['p'] - df['q']).fillna('No Match')
print (df)
p q new
a b
1 0 0.789693 0.665148 0.124544
2 0 0.082677 0.814190 -0.731513
1 0.762339 0.235435 0.526905
3 0 0.727695 NaN No Match
1 0.903596 NaN No Match
2 0.315999 NaN No Match

Use following to get the difference of matached indexes. Unmatch indices will be NaN
diff = df1['p'] - df2['q']
#Output
a b
1 0 -0.666542
2 0 -0.389033
1 0.064986
3 0 NaN
1 NaN
2 NaN
4 0 NaN
1 NaN
2 NaN
dtype: float64

Related

select multiple nth values in grouping with conditional aggregate - pandas

i've got a pd.DataFrame with four columns
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2]
, 'A':['H','H','E','E','H','E','E','H','H']
, 'B':[4,5,2,7,6,1,3,1,0]
, 'C':['M','D','M','D','M','M','M','D','D']})
id A B C
0 1 H 4 M
1 1 H 5 D
2 1 E 2 M
3 1 E 7 D
4 1 H 6 M
5 2 E 1 M
6 2 E 3 M
7 2 H 1 D
8 2 H 0 D
I'd like to group by id and get the value of B for the nth (let's say second) occurrence of A = 'H' for each id in agg_B1 and value of B for the nth (let's say first) occurrence of C='M':
desired output:
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index].loc[df.A== 'H'][1]])
, agg_B2= ('B',lambda x:x[df.loc[x.index].loc[df.C== 'M'][0]])
).reset_index()
TypeError: Indexing a Series with DataFrame is not supported, use the appropriate DataFrame column
Obviously, I'm doing something wrong with the indexing.
Edit: if possible, I'd like to use aggregate with lambda function, because there are multiple aggregate outputs of other sorts that I'd like to extract at the same time.
Your solution is possible change if need GroupBy.agg:
desired_output = df.groupby('id').agg(
agg_B1= ('B',lambda x:x[df.loc[x.index, 'A']== 'H'].iat[1]),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
But if performance is important and also not sure if always exist second value matched H for first condition I suggest processing each condition separately and last add to original aggregated values:
#some sample aggregations
df0 = df.groupby('id').agg({'B':'sum', 'C':'last'})
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(1).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df0, df1, df2], axis=1)
print (desired_output)
B C agg_B1 agg_B2
id
1 24 M 5 4
2 5 D 0 1
EDIT1: If need GroupBy.agg is possible test if failed indexing and then add missing value:
#for second value in sample working nice
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[1]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 5 4
1 2 0 1
#third value not exist so added missing value NaN
def f1(x):
try:
return x[df.loc[x.index, 'A']== 'H'].iat[2]
except:
return np.nan
desired_output = df.groupby('id').agg(
agg_B1= ('B',f1),
agg_B2= ('B',lambda x:x[df.loc[x.index, 'C']== 'M'].iat[0])
).reset_index()
print (desired_output)
id agg_B1 agg_B2
0 1 6.0 4
1 2 NaN 1
What working same like:
df1 = df[df['A'].eq('H')].groupby("id")['B'].nth(2).rename('agg_B1')
df2 = df[df['C'].eq('M')].groupby("id")['B'].first().rename('agg_B2')
desired_output = pd.concat([df1, df2], axis=1)
print (desired_output)
agg_B1 agg_B2
id
1 6.0 4
2 NaN 1
Filter for rows where A equals H, then grab the second row with the nth function :
df.query("A=='H'").groupby("id").nth(1)
A B
id
1 H 5
2 H 0
Python works on a zero based notation, so row 2 will be nth(1)

Pandas: obtaining frequency of a specified value in a row across multiple columns

I have a large dataset with many columns of numeric data and want to be able to count all the zeros in each of the rows. The following will generate a small sample of the data.
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df
While I can create a column to sum all the values in the rows with the following code:
df2=df.sum(axis=1)
df2
And I can get a count of the zeros in a column:
df.loc[df.a==1].count()
I haven't been able to figure out how to get a count of the zeros across each of the rows. Any assistance would be greatly appreciated.
For count matched values is possible use sum of Trues of boolean mask.
If need new column:
df['sum of 1'] = df.eq(1).sum(axis=1)
#alternative
#df['sum of 1'] = (df == 1).sum(axis=1)
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df['sum of 1'] = df.eq(1).sum(axis=1)
print (df)
a b c sum of 1
0 0 0 2 0
1 1 0 1 2
2 0 0 0 0
3 2 1 2 1
4 2 2 1 1
5 0 0 0 0
6 0 2 0 0
7 1 1 1 3
If need new row:
df.loc['sum of 1'] = df.eq(1).sum()
#alternative
#df.loc['sum of 1'] = (df == 1).sum()
Sample:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0, 3, size=(8,3)),columns=list('abc'))
df.loc['sum of 1'] = df.eq(1).sum()
print (df)
a b c
0 0 0 2
1 1 0 1
2 0 0 0
3 2 1 2
4 2 2 1
5 0 0 0
6 0 2 0
7 1 1 1
sum of 1 2 2 3

How can I operate with the output of a DataFrame?

I have a DataFrame object and I'm grouping by some keys and counting the results. The problem is that I want to replace one of the index of the DataFrame columns for a relation between the counts.
df.groupby(['A','B', 'C'])['C'].count().apply(f).reset_index()
I'm looking for an f that replaces the column C by the value of #timesC==1 / #timesC==0 for each value of A and B.
Is this what you want?
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'A':[1,2,3,1,2,3],
'B':[2,0,1,2,0,1],
'C':[1,1,0,1,1,1]
})
print(df)
def f(x):
if np.count_nonzero(x==0)==0:
return np.nan
else:
return np.count_nonzero(x==1)/np.count_nonzero(x==0)
result = df.groupby(['A','B'])['C'].apply(f).reset_index()
print(result)
Result:
#df
A B C
0 1 2 1
1 2 0 1
2 3 1 0
3 1 2 1
4 2 0 1
5 3 1 1
#result
A B C
0 1 2 NaN
1 2 0 NaN
2 3 1 1.0

Checking multiple columns condition in pandas

I want to create a new column in my dataframe that places the name of the column in the row if only that column has a value of 8 in the respective row, otherwise the new column's value for the row would be "NONE". For the dataframe df, the new column df["New_Column"] = ["NONE","NONE","A","NONE"]
df = pd.DataFrame({"A": [1, 2,8,3], "B": [0, 2,4,8], "C": [0, 0,7,8]})
Cool problem.
Find the 8-fields in each row: df==8
Count them: (df==8).sum(axis=1)
Find the rows where the count is 1: (df==8).sum(axis=1)==1
Select just those rows from the original dataframe: df[(df==8).sum(axis=1)==1]==8
Find the 8-fields again: df[(df==8).sum(axis=1)==1]==8)
Find the columns that hold the True values with idxmax (because True>False): (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
Fill in the gaps with "NONE"
To summarize:
df["New_Column"] = (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
df["New_Column"] = df["New_Column"].fillna("NONE")
# A B C New_Column
#0 1 0 0 NONE
#1 2 2 0 NONE
#2 8 4 7 A
#3 3 8 8 NONE
# I added another line as a proof of concept
#4 0 8 0 B
You can accomplish this using idxmax and a mask:
out = (df==8).idxmax(1)
m = ~(df==8).any(1) | ((df==8).sum(1) > 1)
df.assign(col=out.mask(m))
A B C col
0 1 0 0 NaN
1 2 2 0 NaN
2 8 4 7 A
3 3 8 8 NaN
Or do:
df2=df[(df==8)]
df['New_Column']=(df2[(df2!=df2.dropna(thresh=2).values[0]).all(1)].dropna(how='all')).idxmax(1)
df['New_Column'] = df['New_Column'].fillna('NONE')
print(df)
dropna + dropna again + idxmax + fillna. that's all you need for this.
Output:
A B C New_Column
0 1 0 0 NONE
1 2 2 0 NONE
2 8 4 7 A
3 3 8 8 NONE

Decompose cell with multiple values in a DataFrame

I have pandas.DataFrame(...) in the following format(working example):
df = pd.DataFrame({'foo1':[1,2,3], 'foo2': ["a:1, b:2", "d:4", "a:6, d:5"]})
df
foo1 foo2
0 1 a:1, b:2
1 2 d:4
2 3 a:6, d:5
I would like to decompose the foo2 cell values into columns(O/P df):
foo1 foo2_a foo2_b foo2_d
0 1 1 2 0
1 2 0 0 4
2 3 6 0 5
I could iterate all over the dataframe through index, store value for each row - BUT IT DOESN'T seem elegent.
Is there some pandas trick/ elegent/ pythonic solution to this problem ?
Thanks!
If you use
df.foo2.str.split(', ').apply(lambda l: pd.Series({e.split(':')[0]: int(e.split(':')[1]) for e in l})).fillna(0)
You get
a b d
0 1.0 2.0 0.0
1 0.0 0.0 4.0
2 6.0 0.0 5.0
Note that once you get each row into a dictionary, you can transform it into a pandas Series, and this will be the result.
From this point, it is just a question of renaming the columns, and concatenating the result.
Use split + apply with list comprehension for dicts. Then converting column to numpy array by values + tolist, add_prefix and last join column foo1:
s = df['foo2'].str.split(', ').apply(lambda x: dict([y.split(':') for y in x]))
df1 = pd.DataFrame(s.values.tolist()).fillna(0).add_prefix('foo2_').astype(int)
df = df[['foo1']].join(df1)
print (df)
foo1 foo2_a foo2_b foo2_d
0 1 1 2 0
1 2 0 0 4
2 3 6 0 5
#find all the keys ('a','b','d',...)
d = {k:0 for k in df.foo2.str.extractall('([a-z]+)(?=:)').iloc[:,0].unique()}
#split foo2 and build a new DF then merge it into the existing DF.
pd.concat([df['foo1'].to_frame(), df.foo2.str.split(', ')\
.apply(lambda x: pd.Series(dict(d,**dict([e.split(':') for e in x])))).add_prefix('foo2_')], axis=1)
Out[149]:
foo1 foo2_a foo2_b foo2_d
0 1 1 2 0
1 2 0 0 4
2 3 6 0 5

Categories

Resources