How do I delete rows that are duplicated on specified columns? - python

I'm new in python.
My code:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[2,3,2,2,2],
'B':[1,5,5,1,1],
'C':[1,6,6,2,1],
'D':[1,2,3,1,1]})
df
dataframe:
A B C D
0 2 1 1 1
1 3 5 6 2
2 2 5 6 3
3 2 1 2 1
4 2 1 1 1
I want to delete the row and remain the first row, if column B and column C are both the same.
Like,
for row0 & row4, columnB and columnC are the same, delete row4;
for row1 & row2, columnB and columnC are the same, delete row2;

Use drop_duplicates on 'B' and 'C' columns (subset=['B', 'C']) and keep first (keep='first')
>>> df.drop_duplicates(subset=['B', 'C'], keep='first')
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
keep='first' is the default option so you don't have to set it.

You can do something like:
df.groupby(['B', 'C']).head(1)
This takes the first element from each group:
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1

Or:
>>> df[~df[['B', 'C']].duplicated()]
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
>>>

Related

How to create new columns based on whether another group of Columns Exists

My Problem is as follows:
I have a dataframe df which has 5 columns say ('A', 'B', 'C', 'D', 'E')
Now I am looking to combine these columns for some other purposes based on the columns where they are in sets say GP1 = [ 'A', 'B', 'D'] and GP2 = ['C','E'] based on which I will create two new columns.
df['Group1'] = df[GP1].min(axis=1)
df['Group2'] = df[GP2].max(axis=1)
However, it can be possible based on the data that many times say the column 'A' ( or say 'D' or 'B' or maybe all) may be missing from the first set or maybe the column 'C' or 'E' (or both) may be missing from second set.
So what I am looking for is to do something such that the code will check if any of the columns from first set or second set is missing and then only create the new 'Group1' or 'Group2' if all columns exists in a group and if any of the columns in any set is missing it will then skip creating the new column.
How can I achieve that. I was trying for loops but not helping and becoming complicated logic.
An example when all the columns in both set is there:
df_in
A B C D E
1 2 3 4 5
2 4 6 2 3
1 0 2 4 2
df_out
A B C D E Group1 Group2
1 2 3 4 5 1 5
2 4 6 2 3 2 6
1 0 2 4 2 0 2
An example when say E column from second group is not there:
df_in
A B C D
1 2 3 4
2 4 6 2
1 0 2 4
df_out
A B C D Group1
1 2 3 4 1
2 4 6 2 2
1 0 2 4 0
When both A & D are missing from set A ( and only B is there from set/group 1)
df_in
B C E
2 3 5
4 6 3
0 2 2
df_out
B C E Group2
2 3 5 5
4 6 3 6
0 2 2 2
The following case when A from set 1 missing and C from set 2 missing :
df_in
B D E
2 4 5
4 2 3
0 4 2
df_out
B D E
2 4 5
4 2 3
0 4 2
Any help in this direction will be immensely appreciated. Thanks
Here you go, I think you can use this:
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
MCVE:
df_in = pd.read_clipboard() #Read from copy of df_in in the question above
print(df_in)
# A B C D E
# 0 1 2 3 4 5
# 1 2 4 6 2 3
# 2 1 0 2 4 2
gp1 = ['A','B','D']
gp2 = ['C','E']
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
print(df_out)
# A B C D E Group1 Group2
# 0 1 2 3 4 5 1 5
# 1 2 4 6 2 3 2 6
# 2 1 0 2 4 2 0 2
df_in_copy=df_in.copy() #make a copy to reuse later
df_in = df_in.drop('E', axis=1) #Drop Col E
print(df_in)
# A B C D
# 0 1 2 3 4
# 1 2 4 6 2
# 2 1 0 2 4
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
print(df_out)
# A B C D Group1
# 0 1 2 3 4 1
# 1 2 4 6 2 2
# 2 1 0 2 4 0
df_in = df_in_copy.copy() #Copy for copy create
df_in = df_in.drop(['A','D'], axis=1) #Drop Columns A and D
print(df_in)
# B C E
# 0 2 3 5
# 1 4 6 3
# 2 0 2 2
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
print(df_out)
# B C E
# 0 2 3 5
# 1 4 6 3
# 2 0 2 2

search for duplicated consecutive rows and put in additional column pandas

I have a df:
df1
a b c d
0 2 4 1
0 2 5 1
0 1 6 2
1 2 7 2
1 1 8 1
1 1 4 1
I need to group by a and b and if two consecutive values in d are = 1 within groups, I want c in a column next to the row . Like:
df1
a b c d c1
0 2 4 1 5
0 1 6 2 nan
1 2 7 2 nan
1 1 8 1 4
Any ideas?
I tried
df1.groupby([df1.a, df1.b, d.diff().ne(0)]
then loc() only the rows with 1s and merge the two dataframes again, but the first function is not completely correct.

Pandas self join on a single column with no duplicates

Is there a way to find unique rows, where unique is in the sense of two "identical" columns?
>>> d = pandas.DataFrame([['A',1],['A',2],['A',3],['B',1],['B',4],['B',2]], columns = ['col_a','col_b'])
>>> d col_a col_b
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
5 B 2
>>> d.merge(d,left_on='col_b',right_on='col_b') col_a_x col_b col_a_y
0 A 1 A
1 A 1 B
2 B 1 A
3 B 1 B
4 A 2 A
5 A 2 B
6 B 2 A
7 B 2 B
8 A 3 A
9 B 4 B
>>> d_desired
0 A 1 A
1 A 1 B
3 B 1 B
4 A 2 A
5 A 2 B
7 B 2 B
8 A 3 A
9 B 4 B
But I would like to drop the duplicate entries - e.g B 1 A,B 2 A
I would later want to group by the two columns, thus I need somehow to always drop the same "duplicate", meaning if I dropped B1A I should also drop B2A and not A2B.
Try this and see if it works for you :
M = d.merge(d,left_on='col_b',right_on='col_b')
#find rows where col first is greater than col last
#and not equal to each other
cond = (M.col_a_x > M.col_a_y) & (M.col_a_x != M.col_a_y)
#filter out the row
M.loc[~cond]

Pandas: Sort the column on frequency by another column having same value grouped

I've dataframe which is group by y column and sorted on their count column of y column.
Code:
df['count'] = df.groupby(['y'])['y'].transform(pd.Series.value_counts)
df = df.sort('count', ascending=False)
Output:
x y count
1 a 4
3 a 4
2 a 4
1 a 4
2 c 3
1 c 3
2 c 3
2 b 2
1 b 2
Now, I want to sort x column on its frequency having same values grouped on y column like below:
Expected Output:
x y count
1 a 4
1 a 4
2 a 4
3 a 4
2 c 3
2 c 3
1 c 3
2 b 2
1 b 2
It seems you need groupby and value_counts and then numpy.repeat for expand index values by their counts to DataFrame:
s = df.groupby('y', sort=False)['x'].value_counts()
#alternative
#s = df.groupby('y', sort=False)['x'].apply(pd.Series.value_counts)
print (s)
y x
a 1 2
2 1
3 1
c 2 2
1 1
b 1 1
2 1
Name: x, dtype: int64
df1 = pd.DataFrame(np.repeat(s.index.values, s.values).tolist(), columns=['y','x'])
#change order of columns
df1 = df1.reindex_axis(['x','y'], axis=1)
print (df1)
x y
0 1 a
1 1 a
2 2 a
3 3 a
4 2 c
5 2 c
6 1 c
7 1 b
8 2 b
If you are using an older version where df.sort_values is not supported. you can use:
df.sort(columns=['count','x'], ascending=[False,True])

How to groupby based on two columns in pandas?

A similar question might have been asked before, but I couldn't find the exact one fitting to my problem.
I want to group by a dataframe based on two columns.
For exmaple to make this
id product quantity
1 A 2
1 A 3
1 B 2
2 A 1
2 B 1
3 B 2
3 B 1
Into this:
id product quantity
1 A 5
1 B 2
2 A 1
2 B 1
3 B 3
Meaning that summation on "quantity" column for same "id" and same "product".
You need groupby with parameter as_index=False for return DataFrame and aggregating mean:
df = df.groupby(['id','product'], as_index=False)['quantity'].sum()
print (df)
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
Or add reset_index:
df = df.groupby(['id','product'])['quantity'].sum().reset_index()
print (df)
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
You can use pivot_table with aggfunc='sum'
df.pivot_table('quantity', ['id', 'product'], aggfunc='sum').reset_index()
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3
You can use groupby and aggregate function
import pandas as pd
df = pd.DataFrame({
'id': [1,1,1,2,2,3,3],
'product': ['A','A','B','A','B','B','B'],
'quantity': [2,3,2,1,1,2,1]
})
print df
id product quantity
0 1 A 2
1 1 A 3
2 1 B 2
3 2 A 1
4 2 B 1
5 3 B 2
6 3 B 1
df = df.groupby(['id','product']).agg({'quantity':'sum'}).reset_index()
print df
id product quantity
0 1 A 5
1 1 B 2
2 2 A 1
3 2 B 1
4 3 B 3

Categories

Resources