Is there a way to find unique rows, where unique is in the sense of two "identical" columns?
>>> d = pandas.DataFrame([['A',1],['A',2],['A',3],['B',1],['B',4],['B',2]], columns = ['col_a','col_b'])
>>> d col_a col_b
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
5 B 2
>>> d.merge(d,left_on='col_b',right_on='col_b') col_a_x col_b col_a_y
0 A 1 A
1 A 1 B
2 B 1 A
3 B 1 B
4 A 2 A
5 A 2 B
6 B 2 A
7 B 2 B
8 A 3 A
9 B 4 B
>>> d_desired
0 A 1 A
1 A 1 B
3 B 1 B
4 A 2 A
5 A 2 B
7 B 2 B
8 A 3 A
9 B 4 B
But I would like to drop the duplicate entries - e.g B 1 A,B 2 A
I would later want to group by the two columns, thus I need somehow to always drop the same "duplicate", meaning if I dropped B1A I should also drop B2A and not A2B.
Try this and see if it works for you :
M = d.merge(d,left_on='col_b',right_on='col_b')
#find rows where col first is greater than col last
#and not equal to each other
cond = (M.col_a_x > M.col_a_y) & (M.col_a_x != M.col_a_y)
#filter out the row
M.loc[~cond]
Related
I want to groupby and drop groups if it satisfies two conditions (the shape is 3 and column A doesn't contain zeros).
My df
ID value
A 3
A 2
A 0
B 1
B 1
C 3
C 3
C 4
D 0
D 5
D 5
E 6
E 7
E 7
F 3
F 2
my desired df would be
ID value
A 3
A 2
A 0
B 1
B 1
D 0
D 5
D 5
F 3
F 2
You can use boolean indexing with groupby operations:
g = df['value'].eq(0).groupby(df['ID'])
# group contains a 0
m1 = g.transform('any')
# group doesn't have size 3
m2 = g.transform('size').ne(3)
# keep if any of the condition above is met
# this is equivalent to dropping if contains 0 AND size 3
out = df[m1|m2]
Output:
ID value
0 A 3
1 A 2
2 A 0
3 B 1
4 B 1
8 D 0
9 D 5
10 D 5
14 F 3
15 F 2
I'm new in python.
My code:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[2,3,2,2,2],
'B':[1,5,5,1,1],
'C':[1,6,6,2,1],
'D':[1,2,3,1,1]})
df
dataframe:
A B C D
0 2 1 1 1
1 3 5 6 2
2 2 5 6 3
3 2 1 2 1
4 2 1 1 1
I want to delete the row and remain the first row, if column B and column C are both the same.
Like,
for row0 & row4, columnB and columnC are the same, delete row4;
for row1 & row2, columnB and columnC are the same, delete row2;
Use drop_duplicates on 'B' and 'C' columns (subset=['B', 'C']) and keep first (keep='first')
>>> df.drop_duplicates(subset=['B', 'C'], keep='first')
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
keep='first' is the default option so you don't have to set it.
You can do something like:
df.groupby(['B', 'C']).head(1)
This takes the first element from each group:
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
Or:
>>> df[~df[['B', 'C']].duplicated()]
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
>>>
How to drop duplicate in that specific way:
Index B C
1 2 1
2 2 0
3 3 1
4 3 1
5 4 0
6 4 0
7 4 0
8 5 1
9 5 0
10 5 1
Desired output :
Index B C
3 3 1
5 4 0
So dropping duplicate on B but if C is the same on all row and keep one sample/record.
For example, B = 3 for index 3/4 but since C = 1 for both, I do not destroy them all
But for example B = 5 for index 8/9/10 since C = 1 or 0, it get destroy.
Try this, using transform with nunique and drop_duplicates:
df[df.groupby('B')['C'].transform('nunique') == 1].drop_duplicates(subset='B')
Output:
B C
Index
3 3 1
5 4 0
I have a df:
df1
a b c d
0 2 4 1
0 2 5 1
0 1 6 2
1 2 7 2
1 1 8 1
1 1 4 1
I need to group by a and b and if two consecutive values in d are = 1 within groups, I want c in a column next to the row . Like:
df1
a b c d c1
0 2 4 1 5
0 1 6 2 nan
1 2 7 2 nan
1 1 8 1 4
Any ideas?
I tried
df1.groupby([df1.a, df1.b, d.diff().ne(0)]
then loc() only the rows with 1s and merge the two dataframes again, but the first function is not completely correct.
I've dataframe which is group by y column and sorted on their count column of y column.
Code:
df['count'] = df.groupby(['y'])['y'].transform(pd.Series.value_counts)
df = df.sort('count', ascending=False)
Output:
x y count
1 a 4
3 a 4
2 a 4
1 a 4
2 c 3
1 c 3
2 c 3
2 b 2
1 b 2
Now, I want to sort x column on its frequency having same values grouped on y column like below:
Expected Output:
x y count
1 a 4
1 a 4
2 a 4
3 a 4
2 c 3
2 c 3
1 c 3
2 b 2
1 b 2
It seems you need groupby and value_counts and then numpy.repeat for expand index values by their counts to DataFrame:
s = df.groupby('y', sort=False)['x'].value_counts()
#alternative
#s = df.groupby('y', sort=False)['x'].apply(pd.Series.value_counts)
print (s)
y x
a 1 2
2 1
3 1
c 2 2
1 1
b 1 1
2 1
Name: x, dtype: int64
df1 = pd.DataFrame(np.repeat(s.index.values, s.values).tolist(), columns=['y','x'])
#change order of columns
df1 = df1.reindex_axis(['x','y'], axis=1)
print (df1)
x y
0 1 a
1 1 a
2 2 a
3 3 a
4 2 c
5 2 c
6 1 c
7 1 b
8 2 b
If you are using an older version where df.sort_values is not supported. you can use:
df.sort(columns=['count','x'], ascending=[False,True])