Pandas to have one row per email - python

Say I have the following table, Peter and Halla,
Name Age occupation BillingContactEmail
Peter 44 Salesman a#a.com
Andy 43 Manager a#a.com
Halla 33 Fisherman b#b.com
how to make pandas to contain
Name Age occupation BillingContactEmail
Peter 44 Salesman a#a.com
Halla 33 Fisherman b#b.com
where we only contain an instance for an email? (meaning we will have distinct email in the end)

use drop_duplicates
df.drop_duplicates(subset=['BillingContactEmail'])
Name Age occupation BillingContactEmail
0 Peter 44 Salesman a#a.com
2 Halla 33 Fisherman b#b.com
Addressing #DSM's comment
You should be more specific about what criterion you want to use to decide which one to keep. The first seen with that email? The oldest? Etc.
By default, drop_duplicates keeps the first instance found. This is equivalent to
df.drop_duplicates(subset=['BillingContactEmail'], keep='first')
However, you could also specify to keep the last instance via keep='last'
df.drop_duplicates(subset=['BillingContactEmail'], keep='last')
Name Age occupation BillingContactEmail
1 Andy 43 Manager a#a.com
2 Halla 33 Fisherman b#b.com
Or, drop all duplicates
df.drop_duplicates(subset=['BillingContactEmail'], keep=False)
Name Age occupation BillingContactEmail
2 Halla 33 Fisherman b#b.com

Related

How do I append a column in a dataframe and and give each unique string a number?

I'm looking to append a column in a pandas data frame that is similar to the following "Identifier" column:
Name. Age Identifier
Peter Pan 13 PanPe
James Jones 24 JonesJa
Peter Pan 22 PanPe
Chris Smith 19 SmithCh
I need the "Identifier" column to look like:
Identifier
PanPe01
JonesJa01
PanPe02
SmithCh01
How would I number each original string with 01? And if there are duplicates (for example Peter Pan), then the following duplicate strings (after the original 01) will have 02, 03, and so forth?
I've been referred to the following theory:
combo="PanPe"
Counts={}
if combo in counts:
count=counts[combo]
counts[combo]=count+1
else:
counts[combo]=1
However, getting a good example of code would be ideal, as I am relatively new to Python, and would love to know the syntax as how to implement an entire column iterated through this process, instead of just one string as shown above with "PanPe".
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Output:
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
Thank you #dm2 and #Bushmaster

Get names based on another column in pandas dataframe

I need to get the first and last names of people who work in HR department.
FirstName LastName Year Department
83 Joe Faulk 2 Austin Public Library
84 Bryce Benton 5 HR
85 Sarah Cronin 7 Austin Public Library
86 Gabriel Montgomery 2 Austin Resource Recovery
87 Patricia Genty-Andrade 3 HR
This is my code it shows me error AttributeError: 'DataFrame' object has no attribute 'unique'
names = df.iloc[:, 0:4][df['Department'] == 'HR'].unique()
I need the output to be like this
FirstName LastName Department
0 Joe Faulk HR
1 Bryce Benton HR
2 Sarah Cronin HR
3 Gabriel Montgomery HR
4 Patricia Genty-Andrade HR
use drop_duplicates instead of unique, as unique is for Series.
df.loc[df['Department'] == 'HR',
['FirstName', 'LastName', 'Department']].drop_duplicates()

split pandas df row on delimiter and then rename another row based on index

lets say i have the following df:
name job age&dob
bob teacher 35/1-1-85
kyle doctor 25/1-1-95
I want to split the rows age&dob based on the '/' delimiter which can be achieved by putting age&dob into a list and then stacking it. However, i do not know how to rename the row based on age&dob index. For example, i want this.
name metadata age&dob job
bob age 35 teacher
bob dob 1-1-85 teacher
kyle age 25 doctor
kyle dob 1-1-95 doctor
i want metadata to be created by the index based on the split. So in this case, since i know that age&dob.spilt('/')[0] is always going to be age, i want 35 to be there and then metadata to be updated to show age. I know how to split the df, its just the renaming of the additional row value.
Let us do
df['metadata'] = 'age&dob'
df['age&dob'] = df['age&dob'].str.split('/')
s=df.explode('age&dob').assign(metadata=df['metadata'].str.split('&').explode().tolist())
name job age&dob metadata
0 bob teacher 35 age
0 bob teacher 1-1-85 dob
1 kyle doctor 25 age
1 kyle doctor 1-1-95 dob
IIUC, lets use str.split, rename, stack and finally concat
s = df['age&dob'].str.split('/',expand=True).rename({0 : 'age', 1 : 'dob'},axis=1)\
.stack().reset_index(1)\
.rename({'level_1' : 'metadata', 0 : 'age&dob'},axis=1)
df2 = pd.concat([df.drop(['age&dob'],axis=1),s],axis=1)
name job metadata age&dob
0 bob teacher age 35
0 bob teacher dob 1-1-85
1 kyle doctor age 25
1 kyle doctor dob 1-1-95

find the maximum value in a column with respect to other column

i have below data frame:-
input-
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25
1 Molly Jacobson 52 24 94
2 Tina Ali 36 31 57
3 Jake Milner 24 2 62
4 Amy Cooze 73 3 70
i want the output as:-0
Amy 73
so basically i want to find the highest value in age column and i also want the name of person with highest age.
i tried with pandas using group by as below:-
df2=df.groupby(['first_name'])['age'].max()
But with this i am getting the below output as below :
first_name
Amy 73
Jake 24
Jason 42
Molly 52
Tina 36
Name: age, dtype: int64
where as i only want
Amy 73
How shall i go about it in pandas?
You can get your result with the code below
df.loc[df.age.idxmax(),['first_name','age']]
Here, with df.age.idxmax() we are getting the index of the row which has the maximum age value.
Then with df.loc[df.age.idxmax(),['first_name','age']] we are getting the columns 'first_name' & 'age' at that index.
This line of code should do the work
df[df['age']==df['age'].max()][['first_name','age']]
The [['first_name','age']] has the names of columns you want in the result output.
Change as you want.
As in this case the output will be
first_name Age
Amy 73

Replacing values from one dataframe to another

I'm trying to fix discrepancies in a column from one df to a column in another.
The tables are not sorted as well.
How can i do this using python. Example:
df1
Age Name
40 Sid Jones
50 Alex, Bot
32 Tony Jar
65 Fred, Smith
24 Brad, Mans
df2
Age Name
24 Brad Mans
32 Tony Jar
40 Sid Jones
65 Fred Smith
50 Alex Bot
I need to replace the values in df2 to match those in df1 as you can see in my example the discrepancies are commas in the names.
Expected outcome for df2:
Age Name
24 Brad, Mans
32 Tony Jar
40 Sid Jones
65 Fred, Smith
50 Alex, Bot
The values in df2 should be changed to match the df1s values.
Create a column in df1 with commas removed from the Name column
df1['Name_nocomma'] = df1.Name.str.replace(',', '')
merge df1 to df2 using Name_nocomma & Name to get the corrected Name create a new version of df2
df2_out = df2.merge(df1, left_on='Name', right_on='Name_nocomma', how='left')[['Age_x', 'Name_x', 'Name_y']]
use combine_first to coalesce Name_y & Name_x into a new column Name
df2_out['Name'] = df2_out.Name_y.combine_first(df2_out.Name_x)
drop / rename the intermediate columns
del df1['Name_nocomma']
del df2_out['Name_x']
del df2_out['Namy_y']
df2_out.rename({'Age_x': 'Age'}, axis=1, inplace=True)
df2_out
#outputs:
Age Name
0 24 Brad Mans
1 32 Tony Jar
2 40 Sid Jones
3 65 Fred Smith
4 50 Alex Bot
you need sort and append
df1.sort(by=['Age'], inplace = True)
df2.sort(by=['Age'], inplace = True)
result_df = df1.append(df2)

Categories

Resources