Conditionally populating a column with a row value in pandas - python

I have a data set of wages for male and female workers indicated by there name.
Male Female Male_Wage Female_Wage
James Lori 8 9
Mike Nancy 10 8
Ron Cathy 11 12
Jon Ruth 15 9
Jason Jackie 10 10
In pandas I would like to create a new column in the data frame that displays the name of the person that is the highest paid. If the condition exists that both are paid the same the value should be Same.
Male Female Male_Wage Female_Wage Highest_Paid
James Lori 8 9 Lori
Mike Nancy 10 8 Mike
Ron Cathy 11 12 Cathy
Jon Ruth 15 9 Jon
Jason Jackie 10 10 Same
I have been able to add a column and populate it with values, calculate a value based on other columns etc. but not how to fill the new column conditionally based on the value of another column with the condition of same in the instance the wages are the same is causing me trouble. I have searched for an answer quite a bit and have not found anything that covers all the elements of this situation.
Thanks for the help.

You can do this by using loc statements
df.loc[df['Male_Wage'] == df['Female_Wage'], 'Highest_Paid'] = 'Same'
df.loc[df['Male_Wage'] > df['Female_Wage'], 'Highest_Paid'] = df['Male']
df.loc[df['Male_Wage'] < df['Female_Wage'], 'Highest_Paid'] = df['Female']

Method 1: np.select:
We can specify our condtions and based on those condition, we get the values of Male or Female, else default='Same'
conditions = [
df['Male_Wage'] > df['Female_Wage'],
df['Female_Wage'] > df['Male_Wage']
]
choices = [df['Male'], df['Female']]
df['Highest_Paid'] = np.select(conditions, choices, default='Same')
Male Female Male_Wage Female_Wage Highest_Paid
0 James Lori 8 9 Lori
1 Mike Nancy 10 8 Mike
2 Ron Cathy 11 12 Cathy
3 Jon Ruth 15 9 Jon
4 Jason Jackie 10 10 Same
Method 2: np.where + loc
Using np.where and .loc to conditionally assign the correct value:
df['Highest_Paid'] = np.where(df['Male_Wage'] > df['Female_Wage'],
df['Male'],
df['Female'])
df.loc[df['Male_Wage'] == df['Female_Wage'], 'Highest_Paid'] = 'Same'
Male Female Male_Wage Female_Wage Highest_Paid
0 James Lori 8 9 Lori
1 Mike Nancy 10 8 Mike
2 Ron Cathy 11 12 Cathy
3 Jon Ruth 15 9 Jon
4 Jason Jackie 10 10 Same

Related

One to many comparison of data in two different DFs

I have two Dataframes
df1 df2
fname lname age fname lname Position
0 Jack Lee 45 0 Jack Ray 25
1 Joy Kay 34 1 Chris Kay 34
2 Jeff Kim 54 2 Kim Xi 34
3 Josh Chris 29 3 Josh McC 24
4 David Lee 56
5 Aron Dev 41
6 Jack Lee 45
7 Shane Gab 43
8 Joy Kay 34
9 Jack Lee 45
want to compare fname and lname from two dfs and append to a list, Since there is a possibility of multiple repetition of entries from df1 in df2.
(Ex. data of row 1 in df1 is present in row 6 and 9 of df2.)
not very clear on how to fetch one row from df1 and compare with all the rows of df2.(One to many Comparison)
please assist me on the same.
Using pd.merge() with indicator=True will return a clear comparison between the two dataframes based on the columns 'fname' and 'lname':
df = pd.merge(df2,
df1[['fname','lname']],
on=['fname','lname'],
how='left',
indicator=True)
prints:
print(df)
fname lname Position _merge
0 Jack Ray 25 left_only
1 Chris Kay 34 left_only
2 Kim Xi 34 left_only
3 Josh McC 24 left_only
4 David Lee 56 left_only
5 Aron Dev 41 left_only
6 Jack Lee 45 both
7 Shane Gab 43 left_only
8 Joy Kay 34 both
9 Jack Lee 45 both

How to replace the names of rows in a dataframe using to_replace and regex in Python?

so I have a dataframe, and a series of gender. The gender series looks like this:
Male 615
male 206
Female 121
M 116
female 62
F 38
m 34
f 15
Make 4
Woman 3
Male 3
Female 2
Man 2
Cis Male 2
Female (trans) 2
Neuter 1
something kinda male? 1
Femake 1
And I'm trying to use regex to change all female related keywords to "Female". And I could do this by using:
survey['Gender'].replace(to_replace=r'(?i)\bfemale\b', value='Female', regex=True)
But for some reason, this did not change all data with the 'female' keyword such as 'Female (trans)', and I'm most sure as I checked this on Regex tester and it catches 'Female (trans)'.
Another thing I tried is to use replace using a dictionary. But I found that is somewhat inconvinent.
If I want to replace all those "female" related keywords such as 'f', 'femake', 'Female (trans)', how should I do this? What kind of functions should I look into? What would be the most efficient way of doing this?
Use re.IGNORECASE to ignore case sensitivity.
obj = survey['Gender']
survey['gender'] = np.where(obj.str.contains('f|W', flags=re.IGNORECASE), 'Female', 'Male')
Gender gender
0 Male Male
1 male Male
2 Female Female
3 M Male
4 female Female
5 F Female
6 m Male
7 f Female
8 Make Male
9 Woman Female
10 Male Male
11 Female Female
12 Man Male
13 Cis Male Male
14 Female (trans) Female
15 Neuter Male
16 something kinda male? Male
17 Femake Female
handle exception:
cond = obj.str.contains('Make|Neuter|Femake|kinda')
# obj[cond]
survey.loc[cond, 'gender'] = 'other'

Remove duplicates if columns values are same after exchange in Pandas

Let's say I have the following dataframe:
df = pd.DataFrame({'name':['john','mary','peter','jeff','bill'], 'matched_name':['mary','john','jeff','lisa','jose'], 'ratio':[78, 78, 22, 19, 45]})
print(df)
name matched_name ratio
0 john mary 78
1 mary john 78
2 peter jeff 22
3 jeff lisa 19
4 bill jose 45
I want to remove duplicated rows based on condition: if columns name and matched after exchange their cell place are same values and ratio also same then those rows are considered as duplicated rows.
Under above rules, row 0 and row 1 are duplicates, so I will keep only row 0. How could I do it use Pandas? Thanks.
This is expected result:
name matched ratio
0 john mary 78
1 peter jeff 22
2 jeff lisa 19
3 bill jose 45
Use np.sort for sorting values per rows, add column ratio and test duplicates by DataFrame.duplicated, last filter by inverse mask by ~ by boolean indexing:
m = (pd.DataFrame(np.sort(df[['name', 'matched_name']], axis=1), index=df.index)
.assign(ratio=df['ratio'])
.duplicated())
df = df[~m]
print (df)
name matched_name ratio
0 john mary 78
2 peter jeff 22
3 jeff lisa 19
4 bill jose 45
Try the below:
m=pd.DataFrame(np.sort(df.astype(str).values,axis=1)).drop_duplicates().index
df=df.loc[df.index.isin(m)].reset_index()
print(df)
index name matched_name ratio
0 0 john mary 78
1 2 peter jeff 22
2 3 jeff lisa 19
3 4 bill jose 45

How to sort a multiindex pandas dataframe pivot table by the totals of an agg='sum' at each level of the index

Example generated with something like
df = pd.pivot_table(df, index=['name1','name2','name3'], values='total', aggfunc='sum')
No logical sort initially
name1 name2 name3 total
Bob Mario Luigi 5
John Dan 16
Dave Tom Jim 2
Joe 6
Jack 3
Jill Frank 6
Kevin 7
Should become
name1 name2 name3 total
Dave Jill Kevin 7
Frank 6
Tom Joe 6
Jack 3
Jim 2
Bob John Dan 16
Mario Luigi 5
Where Dave is on top because his "total of totals" : 24 is higher than Bob's 21. It propogates to each subsequent index as well so Jill's 13 > Tom's 11, etc. Been messing around with groupby(), sort_values(), sort_index() and determined I don't really know what I'm doing.
What you could do is create additional columns and then do a multicolumn sort.
For the additional column, transform will return a Series with the index aligned to the df so you can then add it as a new column:
from pandas import DataFrame
mydf = DataFrame({"name1":\
["bob","bob","dave","dave","dave","dave","dave"],"name2":\
["mario","john","tom","tom","tom","jill","jill"],"name3":\
["luigi","dan","jim","joe","jack","frank","kevin"],"total":[5,16,2,6,3,6,7]})
mydf["tot1"]=mydf["total"].groupby(mydf["name1"]).transform(sum)
mydf["tot2"]=mydf["total"].groupby(mydf["name2"]).transform(sum)
mydf["tot3"]=mydf["total"].groupby(mydf["name3"]).transform(sum)
mydf.sort_values(by=["tot1","tot2","tot3"],ascending=[False,False,False])
Which yields:
name1 name2 name3 total tot1 tot2 tot3
6 dave jill kevin 7 24 13 7
5 dave jill frank 6 24 13 6
3 dave tom joe 6 24 11 6
4 dave tom jack 3 24 11 3
2 dave tom jim 2 24 11 2
1 bob john dan 16 21 16 16
0 bob mario luigi 5 21 5 5

python pandas groupby sort rank/top n

I have a dataframe that is grouped by state and aggregated to total revenue where sector and name are ignored. I would now like to break the underlying dataset out to show state, sector, name and the top 2 by revenue in a certain order(i have a created an index from a previous dataframe that lists states in a certain order). Using the below example, I would like to use my sorted index (Kentucky, California, New York) that lists only the top two results per state (in previously stated order by Revenue):
Dataset:
State Sector Name Revenue
California 1 Tom 10
California 2 Harry 20
California 3 Roger 30
California 2 Jim 40
Kentucky 2 Bob 15
Kentucky 1 Roger 25
Kentucky 3 Jill 45
New York 1 Sally 50
New York 3 Harry 15
End Goal Dataframe:
State Sector Name Revenue
Kentucky 3 Jill 45
Kentucky 1 Roger 25
California 2 Jim 40
California 3 Roger 30
New York 1 Sally 50
New York 3 Harry 15
You could use a groupby in conjunction with apply:
df.groupby('State').apply(lambda grp: grp.nlargest(2, 'Revenue'))
Output:
Sector Name Revenue
State State
California California 2 Jim 40
California 3 Roger 30
Kentucky Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York New York 1 Sally 50
New York 3 Harry 15
Then you can drop the first level of the MultiIndex to get the result you're after:
df.index = df.index.droplevel()
Output:
Sector Name Revenue
State
California 2 Jim 40
California 3 Roger 30
Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York 1 Sally 50
New York 3 Harry 15
You can sort_values then using groupby + head
df.sort_values('Revenue',ascending=False).groupby('State').head(2)
Out[208]:
State Sector Name Revenue
7 NewYork 1 Sally 50
6 Kentucky 3 Jill 45
3 California 2 Jim 40
2 California 3 Roger 30
5 Kentucky 1 Roger 25
8 NewYork 3 Harry 15

Categories

Resources