How to group by two column with swapped values in pandas?

How to group by two column with swapped values in pandas? - python

I want to group by columns where the commutative rule applies.
For example
column 1, column 2 contains values (a,b) in the first row and (b,a) for another row, then I want to group these two records perform a group by operation.
Input:
From To Count
a1 b1 4
b1 a1 3
a1 b2 2
b3 a1 12
a1 b3 6
Output:
From To Count(+)
a1 b1 7
a1 b2 2
b3 a1 18
I tried to apply group by after swapping the elements. But I don't have any approach to solve this problem. Help me to solve this problem.
Thanks in advance.

Use numpy.sort for sorting each row:
cols = ['From','To']
df[cols] = pd.DataFrame(np.sort(df[cols], axis=1))
print (df)
From To Count
0 a1 b1 4
1 a1 b1 3
2 a1 b2 2
3 a1 b3 12
4 a1 b3 6
df1 = df.groupby(cols, as_index=False)['Count'].sum()
print (df1)
From To Count
0 a1 b1 7
1 a1 b2 2
2 a1 b3 18

Related

How to slice/chop a string using multiple indexes in a panda DataFrame

I'm in need of some advice on the following issue:
I have a DataFrame that looks like this:
ID SEQ LEN BEG_GAP END_GAP
0 A1 AABBCCDDEEFFGG 14 2 4
1 A1 AABBCCDDEEFFGG 14 10 12
2 B1 YYUUUUAAAAMMNN 14 4 6
3 B1 YYUUUUAAAAMMNN 14 8 12
4 C1 LLKKHHUUTTYYYYYYYYAA 20 7 9
5 C1 LLKKHHUUTTYYYYYYYYAA 20 12 15
6 C1 LLKKHHUUTTYYYYYYYYAA 20 17 18
And what I need to get is the SEQ that's separated between the different BEG_GAP and END_GAP. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.
This is what the sequences should look like:
ID SEQ
0 A1 AA---CDDEE---GG
1 B1 YYUU---A-----NN
2 C1 LLKKHHU---YY----Y--A
Or in an exploded DF:
ID Seq_slice
0 A1 AA
1 A1 CDDEE
2 A1 GG
3 B1 YYUU
4 B1 A
5 B1 NN
6 C1 LLKKHHU
7 C1 YY
8 C1 Y
9 C1 A
At the moment, I'm using a piece of code (that I got thanks to a previous question) that works only if there's one gap, and it looks like this:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
But this has the problem that it generates a bunch of sequences that don't really exist because they actually have another gap in the middle.
I.e what it would generate:
ID Seq_slice
0 A1 AA
1 A1 CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1 AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1 GG
And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don't know how to tell the code to have in mind all the gaps while analyzing the sequence.
All advice is appreciated, I hope I was clear!

Let's try defining a function and apply:
def truncate(data):
seq = data.SEQ.iloc[0]
ll = data.LEN.iloc[0]
return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
list(data.BEG_GAP)+[ll])]
(df.groupby('ID').apply(truncate)
.explode().reset_index(name='Seq_slice')
)
Output:
ID Seq_slice
0 A1 AA
1 A1 CCDDEE
2 A1 GG
3 B1 YYUU
4 B1 AA
5 B1 NN
6 C1 LLKKHHU
7 C1 TYY
8 C1 YY
9 C1 AA

In one line:
df.groupby('ID').agg({'BEG_GAP': list, 'END_GAP': list, 'SEQ': max, 'LEN': max}).apply(lambda x: [x['SEQ'][b: e] for b, e in zip([0] + x['END_GAP'], x['BEG_GAP'] + [x['LEN']])], axis=1).explode()
ID
A1 AA
A1 CCDDEE
A1 GG
B1 YYUU
B1 AA
B1 NN
C1 LLKKHHU
C1 TYY
C1 YY
C1 AA

Pandas sort a subset of column based on conditions

Let's say I have the dataframe:
c1 c2
a1 9
a1 11
a1 12
a1 8
a2 10
a2 14
a2 6
I would like to sort only subset a2 of column c1:
c1|c2
a2 6 <=
a1 9
a2 10 <=
a1 11
a1 12
a2 14 <=
a1 8
Here the traditional sorting with sort_values doesn't seem to work.
Also, c2 is composed of only unique values, so there is no possibility to have repeated values.

Lets say your dataframe is in df
df = df[df['c1'] == 'a2']

How to compare two data frames with same columns but different number of rows?

df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.

Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!

First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

select the first N elements of each row in a column

I am looking to select the first two elements of each row in column a and column b.
Here is an example
df = pd.DataFrame({'a': ['A123', 'A567','A100'], 'b': ['A156', 'A266666','A35555']})
>>> df
a b
0 A123 A156
1 A567 A266666
2 A100 A35555
desired output
>>> df
a b
0 A1 A1
1 A5 A2
2 A1 A3
I have been trying to use df.loc but not been successful.

Use
In [905]: df.apply(lambda x: x.str[:2])
Out[905]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
Or,
In [908]: df.applymap(lambda x: x[:2])
Out[908]:
a b
0 A1 A1
1 A5 A2
2 A1 A3

In [107]: df.apply(lambda c: c.str.slice(stop=2))
Out[107]:
a b
0 A1 A1
1 A5 A2
2 A1 A3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to group by two column with swapped values in pandas? - python

Use numpy.sort for sorting each row: cols = ['From','To'] df[cols] = pd.DataFrame(np.sort(df[cols], axis=1)) print (df) From To Count 0 a1 b1 4 1 a1 b1 3 2 a1 b2 2 3 a1 b3 12 4 a1 b3 6 df1 = df.groupby(cols, as_index=False)['Count'].sum() print (df1) From To Count 0 a1 b1 7 1 a1 b2 2 2 a1 b3 18

Related

How to slice/chop a string using multiple indexes in a panda DataFrame

Pandas sort a subset of column based on conditions

How to compare two data frames with same columns but different number of rows?

Summing columns from different dataframe according to some column names

select the first N elements of each row in a column

Categories

Resources