Merge based on partial string match in pandas dfs - python

I have a df that looks like this
first_name last_name
John Doe
Kelly Stevens
Dorey Chang
and another that looks like this
name email
John Doe jdoe23#gmail.com
Kelly M Stevens kelly.stevens#hotmail.com
D Chang chang79#yahoo.com
To merge these 2 tables, such that the end result is
first_name last_name email
John Doe jdoe23#gmail.com
Kelly Stevens kelly.stevens#hotmail.com
Dorey Chang chang79#yahoo.com
I can't merge on name, but all emails contain each persons last name even if the overall format is different. Is there a way to merge these using only a partial string match?
I've tried things like this with no success:
df1['email']= df2[df2['email'].str.contains(df['last_name'])==True]

IIUC, you can do with merge on the result of an extract:
df1.merge(df2.assign(last_name=df2['name'].str.extract(' (\w+)$'))
.drop('name', axis=1),
on='last_name',
how='left')
Output:
first_name last_name email
0 John Doe jdoe23#gmail.com
1 Kelly Stevens kelly.stevens#hotmail.com
2 Dorey Chang chang79#yahoo.com

Related

How to create conditional clause if column in dataframe is empty?

I have a df that looks like this:
fname lname
joe smith
john smith
jane#jane.com
jacky /jax jack
a#a.com non
john (jack) smith
Bob J. Smith
I want to create logic that says that if lname is empty, and if there are two OR three strings in fname seperate the second string OR third string and push it into lname column. If email address in fname leave as is, and if slashes or parenthesis in the fname column and no value in lname leave as is.
new df:
fname lname
joe smith
john smith
jane#jane.com
jacky /jax jack
a#a.com non
john (jack) smith
Bob J. smith
Code so far to seperate two strings:
df[['lname']] = df['name'].loc[df['fname'].str.split().str.len() == 2].str.split(expand=True)
With the following sample dataframe:
df = pd.DataFrame({'fname': ['joe', 'john smith', 'jane#jane.com', 'jacky /jax', 'a#a.com', 'john (jack)', 'Bob J. Smith'],
'lname': ['smith', '', '', 'jack', 'non', 'smith', '']})
You can use np.where():
conditions = (df['lname']=='') & (df['fname'].str.split().str.len()>1)
df['lname'] = np.where(conditions, df['fname'].str.split().str[-1].str.lower(), df['lname'])
Yields:
fname lname
0 joe smith
1 john smith smith
2 jane#jane.com
3 jacky /jax jack
4 a#a.com non
5 john (jack) smith
6 Bob J. Smith smith
To remove the last string from the fname column of the rows that had their lname column populated:
df['fname'] = np.where(conditions, df['fname'].str.split().str[:-1].str.join(' '), df['fname'])
Yields:
fname lname
0 joe smith
1 john smith
2 jane#jane.com
3 jacky /jax jack
4 a#a.com non
5 john (jack) smith
6 Bob J. smith
If I understand correctly you have a dataframe with columns fname and lname. If so then you can modify empty rows in column lname with:
condition = (df.loc[:, 'lname'] == '') & (df.loc[:, 'fname'].str.contains(' '))
df.loc[condition, 'lname'] = df.loc[condition, 'fname'].str.split().str[-1]
The code works for the sample data you have provided in the question but should be improved to be used in more general case.
To modify column fname you may use:
df.loc[condition, 'fname'] = df.loc[condition, 'fname'].str.split().str[:-1].str.join(sep=' ')

How to combine sparse rows in a pandas dataframe

So lets say I have a pandas data frame:
In [1]: import pandas as pd
...
In [4]: df
Out[3]:
Person_ID First_Name Last_Name Phone_Number Email
1 A456 John Doe None None
2 A456 John Doe 123-123-1234 john.doe#test.com
3 A890 Joe Dirt 321-321-4321 None
4 A890 Joe Dirt None joe#email.com
and I would like to cook up some transformation to turn it into this:
Person_ID First_Name Last_Name Phone_Number Email
1 A456 John Doe 123-123-1234 john.doe#test.com
2 A890 Joe Dirt 321-321-4321 joe#email.com
i.e. I would like to be able to take a data frame which may have multiple rows of data corresponding to the same person (and sharing a Person_ID) but missing entries in different places and combine those entries intro a row containing all of the information.
Importantly, this is not just a task of filtering out the rows with more None values in them as for instance in my toy example, line 3 and 4 have an equal number of None's but the data is populated in different places.
Would anyone have advice on how to go about doing this?
Try groupby with fillna with methods ffill and bfill and then drop_duplicates:
df1 = df.groupby('Person_ID').apply(lambda x: x.ffill().bfill()).drop_duplicates()
print (df1)
Person_ID First_Name Last_Name Phone_Number Email
1 A456 John Doe 123-123-1234 john.doe#test.com
3 A890 Joe Dirt 321-321-4321 joe#email.com
You could
In [1571]: df.groupby('Person_ID', as_index=False).apply(
lambda x: x.ffill().bfill().iloc[0])
Out[1571]:
Person_ID First_Name Last_Name Phone_Number Email
0 A456 John Doe 123-123-1234 john.doe#test.com
1 A890 Joe Dirt 321-321-4321 joe#email.com

Merging two columns with different information, python

I have a dataframe with one column of last names, and one column of first names. How do I merge these columns so that I have one column with first and last names?
Here is what I have:
First Name (Column 1)
John
Lisa
Jim
Last Name (Column 2)
Smith
Brown
Dandy
This is what I want:
Full Name
John Smith
Lisa Brown
Jim Dandy.
Thank you!
Try
df.assign(name = df.apply(' '.join, axis = 1)).drop(['first name', 'last name'], axis = 1)
You get
name
0 bob smith
1 john smith
2 bill smith
Here's a sample df:
df
first name last name
0 bob smith
1 john smith
2 bill smith
You can do the following to combine columns:
df['combined']= df['first name'] + ' ' + df['last name']
df
first name last name combined
0 bob smith bob smith
1 john smith john smith
2 bill smith bill smith

Pandas merge two dataframes without duplicating column

My question is similar to Pandas Merge - How to avoid duplicating columns but I cannot find a solution for the specific example below.
I have DateFrame df:
Customer Address
J. Smith 10 Sunny Rd Timbuktu
and Dataframe emails:
Name Email
J. Smith j.smith#myemail.com
I want to merge the two dataframes to produce:
Customer Address Email
J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com
I am using the following code:
data_names = {'Name':data_col[1], ...}
mapped_name = data_names['Name']
df = df.merge(emails, how='inner', left_on='Customer', right_on=mapped_name)
The result is:
Customer Address Email Name
J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com J. Smith
While I could just delete the column named mapped_name, there is the possibility that the mapped_name could be 'Customer' and in that case I dont want to remove both Customer columns.
Any ideas?
I think you can rename first column in email dataframe to Customer, how='inner' can be omit because default value:
emails.columns = ['Customer'] + emails.columns[1:].tolist()
df = df.merge(emails, on='Customer')
print (df)
Customer Address Email
0 J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com
And similar solution as another answer - is possible rename first column selected by [0]:
df = df.merge(emails.rename(columns={emails.columns[0]:'Customer'}), on='Customer')
print (df)
Customer Address Email
0 J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com
You can just rename your email name column to 'Customer' and then merge. This way, you don't need to worry about dropping the column at all.
df.merge(emails.rename(columns={mapped_name:'Customer'}), how='inner', on='Customer')
Out[53]:
Customer Address Email
0 J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com

Duplicated rows when merging dataframes in Python

I am currently merging two dataframes with an outer join. However, after merging, I see all the rows are duplicated even when the columns that I merged upon contain the same values.
Specifically, I have the following code.
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
Here are the two dataframes and the results.
df1
email_address name surname
0 john.smith#email.com john smith
1 john.smith#email.com john smith
2 elvis#email.com elvis presley
df2
email_address street city
0 john.smith#email.com street1 NY
1 john.smith#email.com street1 NY
2 elvis#email.com street2 LA
merged_df
email_address name surname street city
0 john.smith#email.com john smith street1 NY
1 john.smith#email.com john smith street1 NY
2 john.smith#email.com john smith street1 NY
3 john.smith#email.com john smith street1 NY
4 elvis#email.com elvis presley street2 LA
5 elvis#email.com elvis presley street2 LA
My question is, shouldn't it be like this?
This is how I would like my merged_df to be like.
email_address name surname street city
0 john.smith#email.com john smith street1 NY
1 john.smith#email.com john smith street1 NY
2 elvis#email.com elvis presley street2 LA
Are there any ways I can achieve this?
list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.
DO NOT drop duplicates BEFORE the merge, but after!
Best solution is do the merge and then drop the duplicates.
In your case:
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
merged_df.drop_duplicates(subset=['email_address'], keep='first', inplace=True, ignore_index=True)

Categories

Resources