Match two columns in dataframe

Match two columns in dataframe - python

I have two columns in dataframe df
ID Name
AXD2 SAM S
AXD2 SAM
SCA4 JIM
SCA4 JIM JONES
ASCQ JOHN
I need the output to get a unique id and should match the first name only,
ID Name
AXD2 SAM S
SCA4 JIM
ASCQ JOHN
Any suggestions?

You can use groupby with agg and get first of Name
df.groupby(['ID']).agg(first_name=('Name', 'first')).reset_index()

Use drop_duplicates:
out = df.drop_duplicates('ID', ignore_index=True)
print(out)
# Output
ID Name
0 AXD2 SAM S
1 SCA4 JIM
2 ASCQ JOHN

You can use cumcount() to find the first iteration name of the ID
df['RN'] = df.groupby(['ID']).cumcount() + 1
df = df.loc[df['RN'] == 1]
df[['ID', 'Name']]

Related

Check if pandas column contains text in another dataframe and replace values

I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!

You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary

Compare two data-frames with different column names and update first data-frame with the column from second data-frame

I am working on two data-frames which have different column names and dimensions.
First data-frame "df1" contains single column "name" that has names need to be located in second data-frame. If matched, value from df2 first column df2[0] needs to be returned and added in the result_df
Second data-frame "df2" has multiple columns and no header. This contains all the possible diminutive names and full names. Any of the column can have the "name" that needs to be matched
Goal: Locate the name in "df1" in "df2" and if it is matched, return the value from first column of the df2 and add in the respective row of df1
df1
name
ab
alex
bob
robert
bill
df2
0
1
2
3
abram
ab
robert
rob
bob
robbie
alexander
alex
al
william
bill
result_df
name
matched_name
ab
abram
alex
alexander
bob
robert
robert
robert
bill
william
The code i have written so far is giving error. I need to write it as an efficient code as it will be checking millions of entries in df1 with df2:
'''
result_df = process_name(df1, df2)
def process_name(df1, df2):
for elem in df2.values:
if elem in df1['name']:
df1["matched_name"] = df2[0]
'''

Try via concat(),merge(),drop() and rename() and reset_index() method:
df=(pd.concat((df1.merge(df2,left_on='name',right_on=x) for x in df2.columns))
.drop(['1','2','3'],1)
.rename(columns={'0':'matched_name'})
.reset_index(drop=True))
Output of df:
name matched_name
0 robert robert
1 ab abram
2 alex alexander
3 bill william
4 bob robert

Python: Sum values in DataFrame if other values match between DataFrames

I have two dataframes of different length like those:
DataFrame A:
FirstName LastName
Adam Smith
John Johnson
DataFrame B:
First Last Value
Adam Smith 1.2
Adam Smith 1.5
Adam Smith 3.0
John Johnson 2.5
Imagine that what I want to do is to create a new column in "DataFrame A" summing all the values with matching last names, so the output in "A" would be:
FirstName LastName Sums
Adam Smith 5.7
John Johnson 2.5
If I were in Excel, I'd use
=SUMIF(dfB!B:B, B2, dfB!C:C)
In Python I've been trying multiple solutions but using both np.where, df.sum(), dropping indexes etc., but I'm lost. Below code is returning "ValueError: Can only compare identically-labeled Series objects", but I don't think it's written correctly anyways.
df_a['Sums'] = df_a[df_a['LastName'] == df_b['Last']].sum()['Value']
Huge thanks in advance for any help.

Use boolean indexing with Series.isin for filtering and then aggregate sum:
df = (df_b[df_b['Last'].isin(df_a['LastName'])]
.groupby(['First','Last'], as_index=False)['Value']
.sum())
If want match both, first and last name:
df = (df_b.merge(df_a, left_on=['First','Last'], right_on=['FirstName','LastName'])
.groupby(['First','Last'], as_index=False)['Value']
.sum())

df_b_a = (pd.merge(df_b, df_a, left_on=['FirstName', 'LastName'], right_on=['First', 'Last'], how='left')
.groupby(by=['First', 'Last'], as_index=False)['Value'].sum())
print(df_b_a)
First Last Value
0 Adam Smith 5.7
1 John Johnson 2.5

Use DataFrame.merge + DataFrame.groupby:
new_df=( dfa.merge(dfb.groupby(['First','Last'],as_index=False).Value.sum() ,
left_on='LastName',right_on='Last',how='left')
.drop('Last',axis=1) )
print(new_df)
to join for both columns:
new_df=( dfa.merge(dfb.groupby(['First','Last'],as_index=False).Value.sum() ,
left_on=['FirstName','LastName'],right_on=['First','Last'],how='left')
.drop(['First','Last'],axis=1) )
print(new_df)
Output:
FirstName LastName Value
0 Adam Smith 5.7
1 John Johnson 2.5

Fill dataframe nan values from a join

I am trying to map owners to an IP address through the use of two tables, df1 & df2. df1 contains the IP list to be mapped and df2 contains an IP, an alias, and the owner. After running a join on the IP column, it gives me a half joined dataframe. Most of the remaining data can be joined by replacing the NaN values with a join on the Alias column, but I can’t figure out how to do it.
My initial thoughts were to try nesting pd.merge inside fillna(), but it won't accept a dataframe. Any help would be greatly appreciated.
df1 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.0.102', '192.18.0.103', '192.18.0.104']})
df2 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.1.206', '192.18.1.218', '192.18.1.118'],
'Alias' : ['192.18.1.214', '192.18.1.243', '192.18.0.102', '192.18.0.103', '192.18.1.180'],
'Owner' : ['Smith, Jim', 'Bates, Andrew', 'Kline, Jenny', 'Hale, Fred', 'Harris, Robert']})
new_df = pd.DataFrame(pd.merge(df1, df2[['IP', 'Owner']], on='IP', how= 'left'))
Expected output is:
IP Owner
192.18.0.100 Smith, Jim
192.18.0.101 Bates, Andrew
192.18.0.102 Kline, Jenny
192.18.0.103 Hale, Fred
192.18.0.104 nan

No need to merge, Just pull data where condition satisfies. This is way faster than merge and less complicated.
condition = (df1['IP'] == df2['IP']) | (df1['IP'] == df2['Alias'])
df1['Owner'] = np.where(condition, df2['Owner'], np.nan)
print(df1)
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN

Try this one:
new_df = pd.DataFrame(pd.merge(df1, pd.concat([df2[['IP', 'Owner']], df2[['Alias', 'Owner']].rename(columns={"Alias": "IP"})]).drop_duplicates(), on='IP', how= 'left'))
The result:
>>> new_df
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN

Let's melt then use map:
df1['IP'].map(df2.melt('Owner').set_index('value')['Owner'])
Output:
0 Smith, Jim
1 Bates, Andrew
2 Kline, Jenny
3 Hale, Fred
4 NaN
Name: IP, dtype: object

Python - Sort a Pandas Dataframe twice

I would like to sort a Pandas dataframe twice the same way excel does. Given the following df:
Name Date
John 13/01
Mike 13/01
John 15/01
John 14/01
Mike 12/01
When adding the following code:
df=df.sort_values(['Date','Name'], ascending=[True, True])
I would expect the following result:
Name Date
John 13/01
John 14/01
John 15/01
Mike 12/01
Mike 13/01
I'm getting nothing close to this result with the code above. Any idea where's the mistake?
Many thanks!

You need swap columns, because first sort by Name and then by Date, ascending=[True, True] should be removed, because default parameter:
df = df.sort_values(['Name','Date'])
print (df)
Name Date
0 John 13/01
3 John 14/01
2 John 15/01
4 Mike 12/01
1 Mike 13/01

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match two columns in dataframe - python

I have two columns in dataframe df ID Name AXD2 SAM S AXD2 SAM SCA4 JIM SCA4 JIM JONES ASCQ JOHN I need the output to get a unique id and should match the first name only, ID Name AXD2 SAM S SCA4 JIM ASCQ JOHN Any suggestions?

You can use groupby with agg and get first of Name df.groupby(['ID']).agg(first_name=('Name', 'first')).reset_index()

Use drop_duplicates: out = df.drop_duplicates('ID', ignore_index=True) print(out) # Output ID Name 0 AXD2 SAM S 1 SCA4 JIM 2 ASCQ JOHN

You can use cumcount() to find the first iteration name of the ID df['RN'] = df.groupby(['ID']).cumcount() + 1 df = df.loc[df['RN'] == 1] df[['ID', 'Name']]

Related

Check if pandas column contains text in another dataframe and replace values

Compare two data-frames with different column names and update first data-frame with the column from second data-frame

Python: Sum values in DataFrame if other values match between DataFrames

Fill dataframe nan values from a join

Python - Sort a Pandas Dataframe twice

Categories

Resources