Matching IDs with names on a pandas DataFrame - python

I have a DataFrame with 3 columns: ID, BossID and Name. Each row has a unique ID and has a corresponding name. BossID is the ID of the boss of the person in that row. Suppose I have the following DataFrame:
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3],
'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
So here, Anne is Ben's boss and Ben Coe is Cate and Dan's boss, etc.
Now, I want to have another column that has the boss's name for each person.
The desired output is:
id boss name boss_name
0 1 NaN Anne NaN
1 2 1.0 Ben Anne
2 3 2.0 Cate Ben
3 4 2.0 Dan Ben
4 5 3.0 Erin Cate
I can get my output using an ugly double for-loop. Is there a cleaner way to obtain the desired output?

This should work:
bossmap = df.set_index('id')['name'].squeeze()
df['boss_name'] = df['bossId'].map(bossmap)

You can set id as index, then use pd.Series.reindex
df = df.set_index('id')
df['boss_name'] = df['name'].reindex(df['bossId']).to_numpy() # or .to_list()
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe

Create a separate dataframe for 'name' and 'id'.
Renamed 'name' and set 'id' as the index
.merge df with the new dataframe
import pandas as pd
# test dataframe
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3], 'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
# separate dataframe with id and name
names = df[['id', 'name']].dropna().set_index('id').rename(columns={'name': 'boss_name'})
# merge the two
df = df.merge(names, left_on='bossId', right_index=True, how='left')
# df
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe

Related

replace values to a dataframe column from another dataframe

I am trying to replace values from a dataframe column with values from another based on a third one and keep the rest of the values from the first df.
# df1
country name value
romania john 100
russia emma 200
sua mark 300
china jack 400
# df2
name value
emma 2
mark 3
Desired result:
# df3
country name value
romania john 100
russia emma 2
sua mark 3
china jack 400
Thank you
One approach could be as follows:
Use Series.map on column name and turn df2 into a Series for mapping by setting its index to name (df.set_index).
Next, chain Series.fillna to replace NaN values with original values from df.value (i.e. whenever mapping did not result in a match) and assign to df['value'].
df['value'] = df['name'].map(df2.set_index('name')['value']).fillna(df['value'])
print(df)
country name value
0 romania john 100.0
1 russia emma 2.0
2 sua mark 3.0
3 china jack 400.0
N.B. The result will now contain floats. If you prefer integers, chain .astype(int) as well.
Another option could be using pandas.DataFrame.Update:
df1.set_index('name', inplace=True)
df1.update(df2.set_index('name'))
df1.reset_index(inplace=True)
name country value
0 john romania 100.0
1 emma russia 2.0
2 mark sua 3.0
3 jack china 400.0
Another option:
df3 = df1.merge(df2, on = 'name', how = 'left')
df3['value'] = df3.value_y.fillna(df3.value_x)
df3.drop(['value_x', 'value_y'], axis = 1, inplace = True)
# country name value
# 0 romania john 100.0
# 1 russia emma 2.0
# 2 sua mark 3.0
# 3 china jack 400.0
Reproducible data:
df1=pd.DataFrame({'country':['romania','russia','sua','china'],'name':['john','emma','mark','jack'],'value':[100,200,300,400]})
df2=pd.DataFrame({'name':['emma','mark'],'value':[2,3]})

Correct way to iterate over two dataframes to set specific values based on the value of another df

Edited to add easier to reproduce dataframe
I have two dataframes that look something like this:
df1
index = [0,1,2,3,4,5,6,7,8]
a = pd.Series([John Smith, John Smith, John Smith, Kobe Bryant, Kobe Bryant, Kobe Bryant, Jeff Daniels, Jeff Daniels, Jeff Daniels],index= index)
b = pd.Series([7/29/2022, 8/7/2022, 8/29/2022, 7/9/2022, 7/29/2022, 8/9/2022, 7/28/2022, 8/8/2022, 8/28/2022],index= index)
c = pd.Series([185, 187, 186.5, 212.5, 217.5, 220.5, 211.1, 210.5, 213],index= index)
d = pd.Series([],index= index)
df1 = pd.DataFrame(np.d_[a,b,c],columns = ["Name","Date","Weight","Goal"])
or df1 in this format:
Name
Date
Weight
Goal
John Smith
7/29/2022
185
NaN
John Smith
8/7/2022
187
NaN
John Smith
8/29/2022
186.5
NaN
Kobe Bryant
7/9/2022
212.5
NaN
Kobe Bryant
7/29/2022
217.5
NaN
Kobe Bryant
8/9/2022
220.5
NaN
Jeff Daniels
7/28/2022
211.1
NaN
Jeff Daniels
8/8/2022
210.5
NaN
Jeff Daniels
8/28/2022
213
NaN
df2
index = [0,1,2]
a = pd.Series([John Smith, Kobe Bryant, Jeff Daniels],index= index)
b = pd.Series([195,230,220],index= index)
c = pd.Series([],index= index)
df2 = pd.DataFrame(np.c_[a,b],columns = ["Name", "Weight Goal"])
or df2 in this format:
Name
Weight Goal
John Smith
195
Kobe Bryant
230
Jeff Daniels
220
What I want to do is iterate through df1 and set respective weight goal from df2 for each player...but I only want to do this in August, I want to ignore the July dates.
I know that I shouldn't be using a for loop with a dataframe/pandas but I think me showing my mental thought process with one might show the intent that I was trying to achieve with my code attempts.
for player in df1['Name']:
df1 = df1.loc[(df1['Name'] == f'{player}') & (df1['Date'] > '8/1/2022')]
df1.at[df2['Name'] == f'{player}', 'Goal'] = (df2.loc[df2.Name == f'{player}']['Weight Goal'])
This just ends up delivering an empty dataframe & a settingwithcopy warning. I know this is not the right way to do this but I thought it might help to direct me.
Thank You.
If I correctly understand the output you are after (stack overflow tip: it can be useful to provide a sample of your desired output to help people trying to answer your question), then this should work:
# make the Date column into datetime type so it is easier to filter on
df1 = df1.assign(Date=pd.to_datetime(df1.Date))
# separate out the august rows from the other months
df1_august = df1.loc[df1.Date.apply(lambda x: x.month == 8)]
df1_other_months = df1.loc[df1.Date.apply(lambda x: x.month != 8)]
# use a merge rather than a loop to get WeightGoal column in place
df1_august_merged = df1_august.merge(df2, on="Name")
# finally add the rows for the other months back in
final_df = pd.concat([df1_august_merged, df1_other_months])
print(final_df)
Name Date Weight Goal Weight Goal
0 John Smith 2022-08-07 187.0 NaN 195.0
1 John Smith 2022-08-29 186.5 NaN 195.0
2 Kobe Bryant 2022-08-09 220.5 NaN 230.0
3 Jeff Daniels 2022-08-08 210.5 NaN 220.0
4 Jeff Daniels 2022-08-28 213.0 NaN 220.0
0 John Smith 2022-07-29 185.0 NaN NaN
3 Kobe Bryant 2022-07-09 212.5 NaN NaN
4 Kobe Bryant 2022-07-29 217.5 NaN NaN
6 Jeff Daniels 2022-07-28 211.1 NaN NaN

In-place update in pandas: update the value of the cell based on a condition

DOB Name
0 1956-10-30 Anna
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry
6 1972-05-04 Kate
In the dataframe similar to the one above where I have duplicate names. So I am want to add a suffix '_0' to the name if DOB is before 1990 and a duplicate name.
I am expecting a result like this
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I am using the following
df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0')
But I am getting this result
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 NaN
2 2001-09-09 NaN
3 1993-01-15 NaN
4 1999-05-02 NaN
5 1962-12-17 Jerry_0
6 1972-05-04 NaN
How can I add a suffix to the Name which is a duplicate and have to be born before 1990.
Problem in your df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0') is that df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))] is a filtered dataframe whose rows are less than the original. When you assign it back, the not filtered rows doesn't have corresponding value in the filtered dataframe, so it becomes NaN.
You can try mask instead
m = (df['DOB'] < '1990-01-01') & df['Name'].duplicated(keep=False)
df['Name'] = df['Name'].mask(m, df['Name']+'_0')
You can use masks and boolean indexing:
# is the year before 1990?
m1 = pd.to_datetime(df['DOB']).dt.year.lt(1990)
# is the name duplicated?
m2 = df['Name'].duplicated(keep=False)
# if both conditions are True, add '_0' to the name
df.loc[m1&m2, 'Name'] += '_0'
output:
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate

Update a dataframe based on values from other. It is a traditional UPSERT task with a new indicator column

I am trying to do an UPSERT task over two dataframes.
Here I am updating df2 with df1.
I have used something like this:
final_df=df1.set_index('EmpID').combine_first(df2.set_index('EmpID'))
final_df.reset_index()
My result here is:
EmpID Name Salary Status
0 A John 1000.0 Left
1 B Mary 2000.0 Working
2 C Samie 3000.0 Left
3 D Doe 4000.0 NaN
4 E Lance 2500.0 Contractor
Also I am not able to add the 'Indicator' column
I did this and almost achieved my goal, but is there any better way? plus what to do with the column insert?
df=pd.concat([df1, df2[~df2.EmpID.isin(df1.EmpID)]])
df=df.set_index('EmpID').join(df2.set_index('EmpID'),how='outer',rsuffix='_R')
df[['Name','Salary','Status_R']].reset_index()
EmpID Name Salary Status_R
0 A John 1000.0 Left
1 B Mary 2000.0 Working
2 C Samie NaN Left
3 D Doe 4000.0 NaN
4 E Lance 2500.0 Contractor

Fill dataframe nan values from a join

I am trying to map owners to an IP address through the use of two tables, df1 & df2. df1 contains the IP list to be mapped and df2 contains an IP, an alias, and the owner. After running a join on the IP column, it gives me a half joined dataframe. Most of the remaining data can be joined by replacing the NaN values with a join on the Alias column, but I can’t figure out how to do it.
My initial thoughts were to try nesting pd.merge inside fillna(), but it won't accept a dataframe. Any help would be greatly appreciated.
df1 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.0.102', '192.18.0.103', '192.18.0.104']})
df2 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.1.206', '192.18.1.218', '192.18.1.118'],
'Alias' : ['192.18.1.214', '192.18.1.243', '192.18.0.102', '192.18.0.103', '192.18.1.180'],
'Owner' : ['Smith, Jim', 'Bates, Andrew', 'Kline, Jenny', 'Hale, Fred', 'Harris, Robert']})
new_df = pd.DataFrame(pd.merge(df1, df2[['IP', 'Owner']], on='IP', how= 'left'))
Expected output is:
IP Owner
192.18.0.100 Smith, Jim
192.18.0.101 Bates, Andrew
192.18.0.102 Kline, Jenny
192.18.0.103 Hale, Fred
192.18.0.104 nan
No need to merge, Just pull data where condition satisfies. This is way faster than merge and less complicated.
condition = (df1['IP'] == df2['IP']) | (df1['IP'] == df2['Alias'])
df1['Owner'] = np.where(condition, df2['Owner'], np.nan)
print(df1)
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN
Try this one:
new_df = pd.DataFrame(pd.merge(df1, pd.concat([df2[['IP', 'Owner']], df2[['Alias', 'Owner']].rename(columns={"Alias": "IP"})]).drop_duplicates(), on='IP', how= 'left'))
The result:
>>> new_df
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN
Let's melt then use map:
df1['IP'].map(df2.melt('Owner').set_index('value')['Owner'])
Output:
0 Smith, Jim
1 Bates, Andrew
2 Kline, Jenny
3 Hale, Fred
4 NaN
Name: IP, dtype: object

Categories

Resources