Remove duplicates if columns values are same after exchange in Pandas - python

Let's say I have the following dataframe:
df = pd.DataFrame({'name':['john','mary','peter','jeff','bill'], 'matched_name':['mary','john','jeff','lisa','jose'], 'ratio':[78, 78, 22, 19, 45]})
print(df)
name matched_name ratio
0 john mary 78
1 mary john 78
2 peter jeff 22
3 jeff lisa 19
4 bill jose 45
I want to remove duplicated rows based on condition: if columns name and matched after exchange their cell place are same values and ratio also same then those rows are considered as duplicated rows.
Under above rules, row 0 and row 1 are duplicates, so I will keep only row 0. How could I do it use Pandas? Thanks.
This is expected result:
name matched ratio
0 john mary 78
1 peter jeff 22
2 jeff lisa 19
3 bill jose 45

Use np.sort for sorting values per rows, add column ratio and test duplicates by DataFrame.duplicated, last filter by inverse mask by ~ by boolean indexing:
m = (pd.DataFrame(np.sort(df[['name', 'matched_name']], axis=1), index=df.index)
.assign(ratio=df['ratio'])
.duplicated())
df = df[~m]
print (df)
name matched_name ratio
0 john mary 78
2 peter jeff 22
3 jeff lisa 19
4 bill jose 45

Try the below:
m=pd.DataFrame(np.sort(df.astype(str).values,axis=1)).drop_duplicates().index
df=df.loc[df.index.isin(m)].reset_index()
print(df)
index name matched_name ratio
0 0 john mary 78
1 2 peter jeff 22
2 3 jeff lisa 19
3 4 bill jose 45

Related

Identify value based on matching Rows and Columns across two Dataframes

I'm very new to python and I have two dataframes... I'm trying to match the "Names" aka the columns of dataframe 1 with the rows of dataframe 2 and collect the value for the year 2022 with the hopeful output looking like Dataframe 3... I've tried looking through other queries but not found anything to help, any help would be greatly appreciated!
Dataframe 1 - Money Dataframe 2 Dataframe 3
Date Alex Rob Kev Ben Name Name Amount
2022 29 45 65 12 James James
2021 11 32 11 19 Alex Alex 29
2019 45 12 22 76 Carl Carl
Rob Rob 45
Kev Kev 65
There are many different ways to achieve this.
One option is using map:
s = df1.set_index('Date').loc[2022]
df2['Amount'] = df2['Name'].map(s)
output:
Name Amount
0 James NaN
1 Alex 29.0
2 Carl NaN
3 Rob 45.0
4 Kev 65.0
Another option is using merge:
s = df1.set_index('Date').loc[2022]
df3 = df2.merge(s.rename('Amount'), left_on='Name', right_index=True, how='left')

Get the top 2 values for each unique value in another column

I have a DataFrame like this:
student marks term
steve 55 1
jordan 66 2
steve 53 1
alan 74 2
jordan 99 1
steve 81 2
alan 78 1
alan 76 2
jordan 48 1
I would like to return highest two scores for each student
student marks term
steve 81 2
steve 55 1
jordan 99 1
jordan 66 2
alan 78 1
alan 76 2
I have tried
df = df.groupby('student')['marks'].max()
but it returns 1 row, I would like each student in the order they are mentioned with top two scores.
You could use groupby + nlargest to find the 2 largest values; then use loc to sort in the order they appear in df:
out = (df.groupby('student')['marks'].nlargest(2)
.droplevel(1)
.loc[df['student'].drop_duplicates()]
.reset_index())
Output:
student marks
0 steve 81
1 steve 55
2 jordan 99
3 jordan 66
4 alan 78
5 alan 76
If you want to keep "terms" as well, you could use the index:
idx = df.groupby('student')['marks'].nlargest(2).index.get_level_values(1)
out = df.loc[idx].set_index('student').loc[df['student'].drop_duplicates()].reset_index()
Output:
student marks term
0 steve 81 2
1 steve 55 1
2 jordan 99 1
3 jordan 66 2
4 alan 78 1
5 alan 76 2
#sammywemmy suggested a better way to derive the second result:
out = (df.loc[df.groupby('student', sort=False)['marks'].nlargest(2)
.index.get_level_values(1)]
.reset_index(drop=True))
You should use:
df = df.groupby(['student', 'term'])['marks'].max()
(with an optional .reset_index() )
Sorting before grouping should suffice, since you need to keep the term column:
df.sort_values('marks').groupby('student', sort = False).tail(2)
student marks term
0 steve 55 1
1 jordan 66 2
7 alan 76 2
6 alan 78 1
5 steve 81 2
4 jordan 99 1

One to many comparison of data in two different DFs

I have two Dataframes
df1 df2
fname lname age fname lname Position
0 Jack Lee 45 0 Jack Ray 25
1 Joy Kay 34 1 Chris Kay 34
2 Jeff Kim 54 2 Kim Xi 34
3 Josh Chris 29 3 Josh McC 24
4 David Lee 56
5 Aron Dev 41
6 Jack Lee 45
7 Shane Gab 43
8 Joy Kay 34
9 Jack Lee 45
want to compare fname and lname from two dfs and append to a list, Since there is a possibility of multiple repetition of entries from df1 in df2.
(Ex. data of row 1 in df1 is present in row 6 and 9 of df2.)
not very clear on how to fetch one row from df1 and compare with all the rows of df2.(One to many Comparison)
please assist me on the same.
Using pd.merge() with indicator=True will return a clear comparison between the two dataframes based on the columns 'fname' and 'lname':
df = pd.merge(df2,
df1[['fname','lname']],
on=['fname','lname'],
how='left',
indicator=True)
prints:
print(df)
fname lname Position _merge
0 Jack Ray 25 left_only
1 Chris Kay 34 left_only
2 Kim Xi 34 left_only
3 Josh McC 24 left_only
4 David Lee 56 left_only
5 Aron Dev 41 left_only
6 Jack Lee 45 both
7 Shane Gab 43 left_only
8 Joy Kay 34 both
9 Jack Lee 45 both

Conditionally populating a column with a row value in pandas

I have a data set of wages for male and female workers indicated by there name.
Male Female Male_Wage Female_Wage
James Lori 8 9
Mike Nancy 10 8
Ron Cathy 11 12
Jon Ruth 15 9
Jason Jackie 10 10
In pandas I would like to create a new column in the data frame that displays the name of the person that is the highest paid. If the condition exists that both are paid the same the value should be Same.
Male Female Male_Wage Female_Wage Highest_Paid
James Lori 8 9 Lori
Mike Nancy 10 8 Mike
Ron Cathy 11 12 Cathy
Jon Ruth 15 9 Jon
Jason Jackie 10 10 Same
I have been able to add a column and populate it with values, calculate a value based on other columns etc. but not how to fill the new column conditionally based on the value of another column with the condition of same in the instance the wages are the same is causing me trouble. I have searched for an answer quite a bit and have not found anything that covers all the elements of this situation.
Thanks for the help.
You can do this by using loc statements
df.loc[df['Male_Wage'] == df['Female_Wage'], 'Highest_Paid'] = 'Same'
df.loc[df['Male_Wage'] > df['Female_Wage'], 'Highest_Paid'] = df['Male']
df.loc[df['Male_Wage'] < df['Female_Wage'], 'Highest_Paid'] = df['Female']
Method 1: np.select:
We can specify our condtions and based on those condition, we get the values of Male or Female, else default='Same'
conditions = [
df['Male_Wage'] > df['Female_Wage'],
df['Female_Wage'] > df['Male_Wage']
]
choices = [df['Male'], df['Female']]
df['Highest_Paid'] = np.select(conditions, choices, default='Same')
Male Female Male_Wage Female_Wage Highest_Paid
0 James Lori 8 9 Lori
1 Mike Nancy 10 8 Mike
2 Ron Cathy 11 12 Cathy
3 Jon Ruth 15 9 Jon
4 Jason Jackie 10 10 Same
Method 2: np.where + loc
Using np.where and .loc to conditionally assign the correct value:
df['Highest_Paid'] = np.where(df['Male_Wage'] > df['Female_Wage'],
df['Male'],
df['Female'])
df.loc[df['Male_Wage'] == df['Female_Wage'], 'Highest_Paid'] = 'Same'
Male Female Male_Wage Female_Wage Highest_Paid
0 James Lori 8 9 Lori
1 Mike Nancy 10 8 Mike
2 Ron Cathy 11 12 Cathy
3 Jon Ruth 15 9 Jon
4 Jason Jackie 10 10 Same

How to sort a multiindex pandas dataframe pivot table by the totals of an agg='sum' at each level of the index

Example generated with something like
df = pd.pivot_table(df, index=['name1','name2','name3'], values='total', aggfunc='sum')
No logical sort initially
name1 name2 name3 total
Bob Mario Luigi 5
John Dan 16
Dave Tom Jim 2
Joe 6
Jack 3
Jill Frank 6
Kevin 7
Should become
name1 name2 name3 total
Dave Jill Kevin 7
Frank 6
Tom Joe 6
Jack 3
Jim 2
Bob John Dan 16
Mario Luigi 5
Where Dave is on top because his "total of totals" : 24 is higher than Bob's 21. It propogates to each subsequent index as well so Jill's 13 > Tom's 11, etc. Been messing around with groupby(), sort_values(), sort_index() and determined I don't really know what I'm doing.
What you could do is create additional columns and then do a multicolumn sort.
For the additional column, transform will return a Series with the index aligned to the df so you can then add it as a new column:
from pandas import DataFrame
mydf = DataFrame({"name1":\
["bob","bob","dave","dave","dave","dave","dave"],"name2":\
["mario","john","tom","tom","tom","jill","jill"],"name3":\
["luigi","dan","jim","joe","jack","frank","kevin"],"total":[5,16,2,6,3,6,7]})
mydf["tot1"]=mydf["total"].groupby(mydf["name1"]).transform(sum)
mydf["tot2"]=mydf["total"].groupby(mydf["name2"]).transform(sum)
mydf["tot3"]=mydf["total"].groupby(mydf["name3"]).transform(sum)
mydf.sort_values(by=["tot1","tot2","tot3"],ascending=[False,False,False])
Which yields:
name1 name2 name3 total tot1 tot2 tot3
6 dave jill kevin 7 24 13 7
5 dave jill frank 6 24 13 6
3 dave tom joe 6 24 11 6
4 dave tom jack 3 24 11 3
2 dave tom jim 2 24 11 2
1 bob john dan 16 21 16 16
0 bob mario luigi 5 21 5 5

Categories

Resources