Remove duplicate rows with one different value [duplicate] - python

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 4 years ago.
I have a dataframe with duplicate rows except for one value. I want to filter them out and only keep the row with the higher value.
User_ID - Skill - Year_used
1 - skill_a - 2017
1 - skill_b - 2015
1 - skill_a - 2018
2 - skill_c - 2011
etc.
So for example rows with skill_a and the same User_ID need to be compared and only the one with the latest year should be kept.
transform.('count')
Only gives me the amount of rows of the group by User_ID.
value_counts()
Only gives me a series I can't merge back to the df.
Nay ideas?
Thank you

You can use drop_duplicates by sorting a column to keep max
df = df.sort_values('Year_used').drop_duplicates(['User_ID','Skill'], keep='last')

One option is to groupby the Skill and keep the max Year_used:
df.groupby(['User_ID','Skill']).Year_used.max().reset_index()
User_ID Skill Year_used
0 1 skill_a 2018
1 1 skill_b 2015
2 2 skill_c 2011

Related

Error while trying to replace multiple values of a pandas data-frame column based on matching condition [duplicate]

This question already has an answer here:
pandas ValueError: Cannot setitem on a Categorical with a new category, set the categories first
(1 answer)
Closed 3 months ago.
I have a column in a pandas data frame, among other columns, as such:
Remarks
Left_only
Right_only
Left_only
Right_only
For this column, I want to replace all Left_only values to Yesterday And Right_only To Today
I use this code line:
DF.loc[df[‘Remarks’] == ‘Left_only’, ‘Remarks’] = ‘Yesterday’
Similarly, for the other one. But I get this error:
Cannot setitem on a Categorical with a new category (Yesterday), set the categories first
What am I doing wrong ?
# create a dictionary to map the two values
d={'Left_only': 'Yesterday', 'Right_only':'Today'}
df['Remarks']=df['Remarks'].map(d)
df
0 Yesterday
1 Today
2 Yesterday
3 Today
Name: Remarks, dtype: object

Remove all rows with specific ID if other column condition is met [duplicate]

This question already has answers here:
Drop groups in groupby that do not contain an element (Python Pandas)
(1 answer)
Pandas: remove group from the data when a value in the group contains similar value
(4 answers)
Python Pandas - filter groups based on existence of value in group
(1 answer)
Closed last year.
I have a dataframe:
id
country
1
usa
1
mex
1
de
2
br
2
mex
3
usa
I want to remove all Ids that country == usa
Desired output:
id
country
2
br
2
mex

Pandas dataframe - get column index for minimum value in a row [duplicate]

This question already has an answer here:
Python - Pandas: number/index of the minimum value in the given row
(1 answer)
Closed 2 years ago.
I am trying to get the column index for the lowest value in a row. For example, I have the dataframe
0 1 Min. dist
0 765.180690 672.136265 672.136265
1 512.437288 542.701564 512.437288
and need the following output
0 1 Min. dist ColumnID
0 765.180690 672.136265 672.136265 1
1 512.437288 542.701564 512.437288 0
I've gotten the Min. dist column by using the code df['Min. dist'] = df.min(axis=1)
Can anyone help with this? Thanks
Try using idxmin :
df['ColumnID']=df.idxmin(axis=1)

Transform the row to a column and count the occurrence by doing a group by [duplicate]

This question already has answers here:
Pandas, Pivot table from 2 columns with values being a count of one of those columns
(2 answers)
Most efficient way to melt dataframe with a ton of possible values pandas
(2 answers)
How to form a pivot table on two categorical columns and count for each index?
(2 answers)
Closed 2 years ago.
am trying to transform the rows and count the occurrences of the values based on groupby the id
Dataframe:
id value
A cake
A cookie
B cookie
B cookie
C cake
C cake
C cookie
expected:
id cake cookie
A 1 1
B 0 2
c 2 1

How to remove rows that has duplicate values on part of the columns?

I am creating script that reads xlsx file to pandas dataframe and appends new rows to it. However, my problem is that I don't want to add dublicates that have same values in the first four columns (contains 5 columns overall). The fifth column value can be anything, but based on dublicates on these four columns I would like to delete the whole row.
My code is fully functional apart from this. I could do this by looping the dataframe, but I believe that there is smarter way to do this.
Example of data in below. How can I delete the last row, when it has same four columns as the row 4 but different 5th column?
Category Year Week Price Amount
0 1 2019 27 2 1
1 1 2019 28 3 2
2 1 2019 29 4 3
3 2 2019 29 4 4
4 3 2019 30 5 3
5 3 2019 30 5 4
Part of the code:
# Append new rows to dataframe
file_df = file_df.append(new_rows, sort=False, ignore_index=True)
# Delete dublicate rows
combined_df = combined_df.drop_duplicates()
This code now removes only the rows with exactly same column values. Anyway, I could not find smart solution for removing such duplicates. Please correct me, if the question is not relevant.
try pd.drop_duplicates and set subset column on which you want to compare values
df.drop_duplicates(subset=['Category' ,'Year', 'Week' ,'Price'],inplace=True)

Categories

Resources