Pandas Multiindex Groupby aggregate column with value from another column

Pandas Multiindex Groupby aggregate column with value from another column - python

I have a pandas dataframe with multiindex where I want to aggregate the duplicate key rows as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'S':[0,5,0,5,0,3,5,0],'Q':[6,4,10,6,2,5,17,4],'A':
['A1','A1','A1','A1','A2','A2','A2','A2'],
'B':['B1','B1','B2','B2','B1','B1','B1','B2']})
df.set_index(['A','B'])
Q S
A B
A1 B1 6 0
B1 4 5
B2 10 0
B2 6 5
A2 B1 2 0
B1 5 3
B1 17 5
B2 4 0
and I would like to groupby this dataframe to aggregate the Q values (sum) and keep the S value that corresponds to the maximal row of the Q value yielding this:
df2 = pd.DataFrame({'S':[0,0,5,0],'Q':[10,16,24,4],'A':
['A1','A1','A2','A2'],
'B':['B1','B2','B1','B2']})
df2.set_index(['A','B'])
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
I tried the following, but it didn't work:
df.groupby(by=['A','B']).agg({'Q':'sum','S':df.S[df.Q.idxmax()]})
any hints?

One way is to use agg, apply, and join:
g = df.groupby(['A','B'], group_keys=False)
g.apply(lambda x: x.loc[x.Q == x.Q.max(),['S']]).join(g.agg({'Q':'sum'}))
Output:
S Q
A B
A1 B1 0 10
B2 0 16
A2 B1 5 24
B2 0 4

Here's one way
In [1800]: def agg(x):
...: m = x.S.iloc[np.argmax(x.Q.values)]
...: return pd.Series({'Q': x.Q.sum(), 'S': m})
...:
In [1801]: df.groupby(['A', 'B']).apply(agg)
Out[1801]:
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0

Related

How to slice/chop a string using multiple indexes in a panda DataFrame

I'm in need of some advice on the following issue:
I have a DataFrame that looks like this:
ID SEQ LEN BEG_GAP END_GAP
0 A1 AABBCCDDEEFFGG 14 2 4
1 A1 AABBCCDDEEFFGG 14 10 12
2 B1 YYUUUUAAAAMMNN 14 4 6
3 B1 YYUUUUAAAAMMNN 14 8 12
4 C1 LLKKHHUUTTYYYYYYYYAA 20 7 9
5 C1 LLKKHHUUTTYYYYYYYYAA 20 12 15
6 C1 LLKKHHUUTTYYYYYYYYAA 20 17 18
And what I need to get is the SEQ that's separated between the different BEG_GAP and END_GAP. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.
This is what the sequences should look like:
ID SEQ
0 A1 AA---CDDEE---GG
1 B1 YYUU---A-----NN
2 C1 LLKKHHU---YY----Y--A
Or in an exploded DF:
ID Seq_slice
0 A1 AA
1 A1 CDDEE
2 A1 GG
3 B1 YYUU
4 B1 A
5 B1 NN
6 C1 LLKKHHU
7 C1 YY
8 C1 Y
9 C1 A
At the moment, I'm using a piece of code (that I got thanks to a previous question) that works only if there's one gap, and it looks like this:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
But this has the problem that it generates a bunch of sequences that don't really exist because they actually have another gap in the middle.
I.e what it would generate:
ID Seq_slice
0 A1 AA
1 A1 CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1 AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1 GG
And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don't know how to tell the code to have in mind all the gaps while analyzing the sequence.
All advice is appreciated, I hope I was clear!

Let's try defining a function and apply:
def truncate(data):
seq = data.SEQ.iloc[0]
ll = data.LEN.iloc[0]
return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
list(data.BEG_GAP)+[ll])]
(df.groupby('ID').apply(truncate)
.explode().reset_index(name='Seq_slice')
)
Output:
ID Seq_slice
0 A1 AA
1 A1 CCDDEE
2 A1 GG
3 B1 YYUU
4 B1 AA
5 B1 NN
6 C1 LLKKHHU
7 C1 TYY
8 C1 YY
9 C1 AA

In one line:
df.groupby('ID').agg({'BEG_GAP': list, 'END_GAP': list, 'SEQ': max, 'LEN': max}).apply(lambda x: [x['SEQ'][b: e] for b, e in zip([0] + x['END_GAP'], x['BEG_GAP'] + [x['LEN']])], axis=1).explode()
ID
A1 AA
A1 CCDDEE
A1 GG
B1 YYUU
B1 AA
B1 NN
C1 LLKKHHU
C1 TYY
C1 YY
C1 AA

Pandas: pivoting rows to columns with columns as column-row

I have a data frame that looks like this
df = pd.DataFrame({'A': [1,2,3], 'B': [11,12,13]})
df
A B
0 1 11
1 2 12
2 3 13
I would like to create the following data frame where the columns are a combination of each column-row
A0 A1 A2 B0 B1 B2
0 1 2 3 11 12 13
It seems that the pivot and transpose functions will switch columns and rows but I actually want to flatten the data frame to a single row. How can I achieve this?

IIUC
s=df.stack().sort_index(level=1).to_frame(0).T
s.columns=s.columns.map('{0[1]}{0[0]}'.format)
s
A0 A1 A2 B0 B1 B2
0 1 2 3 11 12 13

One option, with pivot_wider:
# pip install pyjanitor
import janitor
import pandas as pd
df.index = [0] * len(df)
df = df.assign(num=range(len(df)))
df.pivot_wider(names_from="num", names_sep = "")
A0 A1 A2 B0 B1 B2
0 1 2 3 11 12 13

How to compare two data frames with same columns but different number of rows?

df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.

Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1

How to turn convert rows to columns in pandas?

I want to convert every three rows of a DataFrame into columns .
Input:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,11,12,13],'b':['a','b','c','aa','bb','cc']})
print(df)
Output:
a b
0 1 a
1 2 b
2 3 c
3 11 aa
4 12 bb
5 13 cc
Expected:
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc

Use set_index by floor division and modulo by 3 with unstack and flattening MultiIndex:
a = np.arange(len(df))
#if default index
#a = df.index
df1 = df.set_index([a // 3, a % 3]).unstack()
#python 3.6+ solution
df1.columns = [f'{i}{j + 1}' for i,j in df1.columns]
#python bellow 3.6
#df1.columns = ['{}{}'.format(i,j+1) for i,j in df1.columns]
print (df1)
a1 a2 a3 b1 b2 b3
0 1 2 3 a b c
1 11 12 13 aa bb cc

I'm adding a different approach with group -> apply.
df is first grouped by df.index//3 and then the munge function is applied to each group.
def munge(group):
g = group.T.stack()
g.index = ['{}{}'.format(c, i+1) for i, (c, _) in enumerate(g.index)]
return g
result = df.groupby(df.index//3).apply(munge)
Output:
>>> df.groupby(df.index//3).apply(munge)
a1 a2 a3 b4 b5 b6
0 1 2 3 a b c
1 11 12 13 aa bb cc

How to group by two column with swapped values in pandas?

I want to group by columns where the commutative rule applies.
For example
column 1, column 2 contains values (a,b) in the first row and (b,a) for another row, then I want to group these two records perform a group by operation.
Input:
From To Count
a1 b1 4
b1 a1 3
a1 b2 2
b3 a1 12
a1 b3 6
Output:
From To Count(+)
a1 b1 7
a1 b2 2
b3 a1 18
I tried to apply group by after swapping the elements. But I don't have any approach to solve this problem. Help me to solve this problem.
Thanks in advance.

Use numpy.sort for sorting each row:
cols = ['From','To']
df[cols] = pd.DataFrame(np.sort(df[cols], axis=1))
print (df)
From To Count
0 a1 b1 4
1 a1 b1 3
2 a1 b2 2
3 a1 b3 12
4 a1 b3 6
df1 = df.groupby(cols, as_index=False)['Count'].sum()
print (df1)
From To Count
0 a1 b1 7
1 a1 b2 2
2 a1 b3 18

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Multiindex Groupby aggregate column with value from another column - python

One way is to use agg, apply, and join: g = df.groupby(['A','B'], group_keys=False) g.apply(lambda x: x.loc[x.Q == x.Q.max(),['S']]).join(g.agg({'Q':'sum'})) Output: S Q A B A1 B1 0 10 B2 0 16 A2 B1 5 24 B2 0 4

Here's one way In [1800]: def agg(x): ...: m = x.S.iloc[np.argmax(x.Q.values)] ...: return pd.Series({'Q': x.Q.sum(), 'S': m}) ...: In [1801]: df.groupby(['A', 'B']).apply(agg) Out[1801]: Q S A B A1 B1 10 0 B2 16 0 A2 B1 24 5 B2 4 0

Related

How to slice/chop a string using multiple indexes in a panda DataFrame

Pandas: pivoting rows to columns with columns as column-row

How to compare two data frames with same columns but different number of rows?

How to turn convert rows to columns in pandas?

How to group by two column with swapped values in pandas?

Categories

Resources