find inverse/mirror pair and assign a pair number - python

I am trying to find the inverse pair and assign a pair number to the pair but am stuck for moving forward from the below.
df1:
col1 col2 no. of records
A B 2
B A 5
C D 4
D C 6
E F 4
G H 6
I am trying get this result.
col1 col2 pair 1 no. of records totalcount
A B 1 2 7
B A 1 5 7
C D 2 4 10
D C 2 6 10
E F 3 4 4
G H 4 6 6
I tried this method but it has only returned true/false.
to make a duplicate dataframe df2 and use isin function but was stucked for a long time while group them together.
df1['row_matched'] = np.where((df1.col1+df1.col2).isin(df2.col2+ df2.col1), df2['row'], "")
will appreciate any help available!

Use rank of group pair of col1, col2, which you could setup with set
In [37]: df['pair'] = (df.apply(lambda x: '-'.join(set(x[['col1', 'col2']])), 1)
.rank(method='dense').astype(int))
In [38]: df['totalcount'] = df.groupby('pair')['no.ofrecords'].transform('sum')
In [39]: df
Out[39]:
col1 col2 no.ofrecords pair totalcount
0 A B 2 1 7
1 B A 5 1 7
2 C D 4 2 10
3 D C 6 2 10
4 E F 4 3 4
5 G H 6 4 6

Related

I want to groupby and drop groups if the shape is 3 and non of the values from a column contains zero

I want to groupby and drop groups if it satisfies two conditions (the shape is 3 and column A doesn't contain zeros).
My df
ID value
A 3
A 2
A 0
B 1
B 1
C 3
C 3
C 4
D 0
D 5
D 5
E 6
E 7
E 7
F 3
F 2
my desired df would be
ID value
A 3
A 2
A 0
B 1
B 1
D 0
D 5
D 5
F 3
F 2
You can use boolean indexing with groupby operations:
g = df['value'].eq(0).groupby(df['ID'])
# group contains a 0
m1 = g.transform('any')
# group doesn't have size 3
m2 = g.transform('size').ne(3)
# keep if any of the condition above is met
# this is equivalent to dropping if contains 0 AND size 3
out = df[m1|m2]
Output:
ID value
0 A 3
1 A 2
2 A 0
3 B 1
4 B 1
8 D 0
9 D 5
10 D 5
14 F 3
15 F 2

coding 2 columns in pandas with the same key

I am asked to code the following 2 columns, and you have these values, when using the method cat.codes the problem arises that the 2 columns are not with the same codes, what I want is that the data that are equal are with the same code?
Example:
The input is a dataframe
col1 col2
0 A E
1 B F
2 C A
3 D B
4 A B
5 E A
Assuming this input as df:
col1 col2
0 A E
1 B F
2 C A
3 D B
4 A B
5 E A
You can compute the unique values and use them to factorize:
vals = df[['col1', 'col2']].stack().unique()
d = {k:v for v,k in enumerate(vals)}
df['col1_codes'] = df['col1'].map(d)
df['col2_codes'] = df['col2'].map(d)
output:
col1 col2 col1_codes col2_codes
0 A E 0 1
1 B F 2 3
2 C A 4 0
3 D B 5 2
4 A B 0 2
5 E A 1 0
You can try below as well
a b
0 apple nokia
1 xiomi samsung
2 samsung apple
3 moto oneplus
import pandas as pd
from sklearn import preprocessing
cat_var = list(df.a.values)+list(df.b.values)
le = preprocessing.LabelEncoder()
le.fit(cat_var)
df['a'] = le.transform(df.a)
df['b'] = le.transform(df.b)
will give you below output
a b
0 0 2
1 5 4
2 4 0
3 1 3

Pandas self join on a single column with no duplicates

Is there a way to find unique rows, where unique is in the sense of two "identical" columns?
>>> d = pandas.DataFrame([['A',1],['A',2],['A',3],['B',1],['B',4],['B',2]], columns = ['col_a','col_b'])
>>> d col_a col_b
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
5 B 2
>>> d.merge(d,left_on='col_b',right_on='col_b') col_a_x col_b col_a_y
0 A 1 A
1 A 1 B
2 B 1 A
3 B 1 B
4 A 2 A
5 A 2 B
6 B 2 A
7 B 2 B
8 A 3 A
9 B 4 B
>>> d_desired
0 A 1 A
1 A 1 B
3 B 1 B
4 A 2 A
5 A 2 B
7 B 2 B
8 A 3 A
9 B 4 B
But I would like to drop the duplicate entries - e.g B 1 A,B 2 A
I would later want to group by the two columns, thus I need somehow to always drop the same "duplicate", meaning if I dropped B1A I should also drop B2A and not A2B.
Try this and see if it works for you :
M = d.merge(d,left_on='col_b',right_on='col_b')
#find rows where col first is greater than col last
#and not equal to each other
cond = (M.col_a_x > M.col_a_y) & (M.col_a_x != M.col_a_y)
#filter out the row
M.loc[~cond]

pandas group by the column values with all values less than certain numbers and assign the group numbers as new columns

I have a data frame like this,
df
col1 col2
A 2
B 3
C 1
D 4
E 6
F 1
G 2
H 8
I 1
J 10
Now I want to create another column col3 with grouping all the col2 values which are under below 5 and keep col3 values as 1 to number of groups, so the final data frame would look like,
col1 col2 col3
A 2 1
B 3 1
C 1 1
D 4 1
E 6 2
F 1 2
G 2 2
H 8 3
I 1 3
J 10 4
I could do this comparing the the prev values with the current values and store into a list and make it the col3.
But the execution time will be huge in this case, so looking for some shortcuts/pythonic way to do it most efficiently.
Compare by Series.gt for > and then use Series.cumsum. New column always starts by 0, because first values of column is less like 5, else it should be 1:
df['col3'] = df['col2'].gt(5).cumsum()
print (df)
col1 col2 col3
0 A 2 0
1 B 3 0
2 C 1 0
3 D 4 0
4 E 6 1
5 F 1 1
6 G 2 1
7 H 8 2
8 I 1 2
9 J 10 3
So for general solution starting by 1 use this trick - compare first values if less like 5, convert to integers for True->1 and False->0 and add to column:
N = 5
df['col3'] = df['col2'].gt(N).cumsum() + int(df.loc[0, 'col2'] < 5)
df = df.assign(col21 = df['col2'].add(pd.Series({0:5}), fill_value=0).astype(int))
N = 5
df['col3'] = df['col2'].gt(N).cumsum() + int(df.loc[0, 'col2'] < N)
#test for first value > 5
df['col31'] = df['col21'].gt(N).cumsum() + int(df.loc[0, 'col21'] < N)
print (df)
col1 col2 col21 col3 col31
0 A 2 7 1 1
1 B 3 3 1 1
2 C 1 1 1 1
3 D 4 4 1 1
4 E 6 6 2 2
5 F 1 1 2 2
6 G 2 2 2 2
7 H 8 8 3 3
8 I 1 1 3 3
9 J 10 10 4 4

How to groupby with consecutive occurrence of duplicates in pandas

I have a dataframe which contains two columns [Name,In.cl]. I want to groupby Name but it based on continuous occurrence. For example consider below DataFrame,
Code to generate below DF:
df=pd.DataFrame({'Name':['A','B','B','A','A','B','C','C','C','B','C'],'In.Cl':[2,1,5,2,4,2,3,1,8,5,7]})
Input:
In.Cl Name
0 2 A
1 1 B
2 5 B
3 2 A
4 4 A
5 2 B
6 3 C
7 1 C
8 8 C
9 5 B
10 7 C
I want to group the rows where it repeated consecutively. example group [B] (1,2), [A] (3,4), [C] (6,8) etc., and perform sum operation in In.cl column.
Expected Output:
In.Cl Name col1 col2
0 2 A A(1) 2
1 1 B B(2) 6
2 5 B B(2) 6
3 2 A A(2) 6
4 4 A A(2) 6
5 2 B B(1) 2
6 3 C C(3) 12
7 1 C C(3) 12
8 8 C C(3) 12
9 5 B B(1) 5
10 7 C C(1) 7
So far i tried combination of duplicate and groupby, it didn't work as i expected. I think I need some thing groupby + consecutive. but i don't have an idea to solve this problem.
Any help would be appreciated.
In [37]: g = df.groupby((df.Name != df.Name.shift()).cumsum())
In [38]: df['col1'] = df['Name'] + '(' + g['In.Cl'].transform('size').astype(str) + ')'
In [39]: df['col2'] = g['In.Cl'].transform('sum')
In [40]: df
Out[40]:
Name In.Cl col1 col2
0 A 2 A(1) 2
1 B 1 B(2) 6
2 B 5 B(2) 6
3 A 2 A(2) 6
4 A 4 A(2) 6
5 B 2 B(1) 2
6 C 3 C(3) 12
7 C 1 C(3) 12
8 C 8 C(3) 12
9 B 5 B(1) 5
10 C 7 C(1) 7
Slightly long-winded answer utilizing itertools.groupby.
For greater than ~1000 rows, use #MaxU's solution - it's faster.
from itertools import groupby, chain
from operator import itemgetter
chainer = chain.from_iterable
def sumfunc(x):
return (sum(map(itemgetter(1), x)), len(x))
grouper = groupby(zip(df['Name'], df['In.Cl']), key=itemgetter(0))
summer = [sumfunc(list(j)) for _, j in grouper]
df['Name'] += pd.Series(list(chainer(repeat(j, j) for i, j in summer))).astype(str)
df['col2'] = list(chainer(repeat(i, j) for i, j in summer))
print(df)
In.Cl Name col2
0 2 A1 2
1 1 B2 6
2 5 B2 6
3 2 A2 6
4 4 A2 6
5 2 B1 2
6 3 C3 12
7 1 C3 12
8 8 C3 12
9 5 B1 5
10 7 C1 7

Categories

Resources