Pandas sort a subset of column based on conditions - python

Let's say I have the dataframe:
c1 c2
a1 9
a1 11
a1 12
a1 8
a2 10
a2 14
a2 6
I would like to sort only subset a2 of column c1:
c1|c2
a2 6 <=
a1 9
a2 10 <=
a1 11
a1 12
a2 14 <=
a1 8
Here the traditional sorting with sort_values doesn't seem to work.
Also, c2 is composed of only unique values, so there is no possibility to have repeated values.

Lets say your dataframe is in df
df = df[df['c1'] == 'a2']

Related

How to slice/chop a string using multiple indexes in a panda DataFrame

I'm in need of some advice on the following issue:
I have a DataFrame that looks like this:
ID SEQ LEN BEG_GAP END_GAP
0 A1 AABBCCDDEEFFGG 14 2 4
1 A1 AABBCCDDEEFFGG 14 10 12
2 B1 YYUUUUAAAAMMNN 14 4 6
3 B1 YYUUUUAAAAMMNN 14 8 12
4 C1 LLKKHHUUTTYYYYYYYYAA 20 7 9
5 C1 LLKKHHUUTTYYYYYYYYAA 20 12 15
6 C1 LLKKHHUUTTYYYYYYYYAA 20 17 18
And what I need to get is the SEQ that's separated between the different BEG_GAP and END_GAP. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.
This is what the sequences should look like:
ID SEQ
0 A1 AA---CDDEE---GG
1 B1 YYUU---A-----NN
2 C1 LLKKHHU---YY----Y--A
Or in an exploded DF:
ID Seq_slice
0 A1 AA
1 A1 CDDEE
2 A1 GG
3 B1 YYUU
4 B1 A
5 B1 NN
6 C1 LLKKHHU
7 C1 YY
8 C1 Y
9 C1 A
At the moment, I'm using a piece of code (that I got thanks to a previous question) that works only if there's one gap, and it looks like this:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
But this has the problem that it generates a bunch of sequences that don't really exist because they actually have another gap in the middle.
I.e what it would generate:
ID Seq_slice
0 A1 AA
1 A1 CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1 AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1 GG
And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don't know how to tell the code to have in mind all the gaps while analyzing the sequence.
All advice is appreciated, I hope I was clear!
Let's try defining a function and apply:
def truncate(data):
seq = data.SEQ.iloc[0]
ll = data.LEN.iloc[0]
return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
list(data.BEG_GAP)+[ll])]
(df.groupby('ID').apply(truncate)
.explode().reset_index(name='Seq_slice')
)
Output:
ID Seq_slice
0 A1 AA
1 A1 CCDDEE
2 A1 GG
3 B1 YYUU
4 B1 AA
5 B1 NN
6 C1 LLKKHHU
7 C1 TYY
8 C1 YY
9 C1 AA
In one line:
df.groupby('ID').agg({'BEG_GAP': list, 'END_GAP': list, 'SEQ': max, 'LEN': max}).apply(lambda x: [x['SEQ'][b: e] for b, e in zip([0] + x['END_GAP'], x['BEG_GAP'] + [x['LEN']])], axis=1).explode()
ID
A1 AA
A1 CCDDEE
A1 GG
B1 YYUU
B1 AA
B1 NN
C1 LLKKHHU
C1 TYY
C1 YY
C1 AA

How to compare two data frames with same columns but different number of rows?

df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1

How to group by two column with swapped values in pandas?

I want to group by columns where the commutative rule applies.
For example
column 1, column 2 contains values (a,b) in the first row and (b,a) for another row, then I want to group these two records perform a group by operation.
Input:
From To Count
a1 b1 4
b1 a1 3
a1 b2 2
b3 a1 12
a1 b3 6
Output:
From To Count(+)
a1 b1 7
a1 b2 2
b3 a1 18
I tried to apply group by after swapping the elements. But I don't have any approach to solve this problem. Help me to solve this problem.
Thanks in advance.
Use numpy.sort for sorting each row:
cols = ['From','To']
df[cols] = pd.DataFrame(np.sort(df[cols], axis=1))
print (df)
From To Count
0 a1 b1 4
1 a1 b1 3
2 a1 b2 2
3 a1 b3 12
4 a1 b3 6
df1 = df.groupby(cols, as_index=False)['Count'].sum()
print (df1)
From To Count
0 a1 b1 7
1 a1 b2 2
2 a1 b3 18

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

Pandas Multiindex Groupby aggregate column with value from another column

I have a pandas dataframe with multiindex where I want to aggregate the duplicate key rows as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'S':[0,5,0,5,0,3,5,0],'Q':[6,4,10,6,2,5,17,4],'A':
['A1','A1','A1','A1','A2','A2','A2','A2'],
'B':['B1','B1','B2','B2','B1','B1','B1','B2']})
df.set_index(['A','B'])
Q S
A B
A1 B1 6 0
B1 4 5
B2 10 0
B2 6 5
A2 B1 2 0
B1 5 3
B1 17 5
B2 4 0
and I would like to groupby this dataframe to aggregate the Q values (sum) and keep the S value that corresponds to the maximal row of the Q value yielding this:
df2 = pd.DataFrame({'S':[0,0,5,0],'Q':[10,16,24,4],'A':
['A1','A1','A2','A2'],
'B':['B1','B2','B1','B2']})
df2.set_index(['A','B'])
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
I tried the following, but it didn't work:
df.groupby(by=['A','B']).agg({'Q':'sum','S':df.S[df.Q.idxmax()]})
any hints?
One way is to use agg, apply, and join:
g = df.groupby(['A','B'], group_keys=False)
g.apply(lambda x: x.loc[x.Q == x.Q.max(),['S']]).join(g.agg({'Q':'sum'}))
Output:
S Q
A B
A1 B1 0 10
B2 0 16
A2 B1 5 24
B2 0 4
Here's one way
In [1800]: def agg(x):
...: m = x.S.iloc[np.argmax(x.Q.values)]
...: return pd.Series({'Q': x.Q.sum(), 'S': m})
...:
In [1801]: df.groupby(['A', 'B']).apply(agg)
Out[1801]:
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0

Categories

Resources