Duplicated rows when merging dataframes in Python

Duplicated rows when merging dataframes in Python - python

I am currently merging two dataframes with an outer join. However, after merging, I see all the rows are duplicated even when the columns that I merged upon contain the same values.
Specifically, I have the following code.
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
Here are the two dataframes and the results.
df1
email_address name surname
0 john.smith#email.com john smith
1 john.smith#email.com john smith
2 elvis#email.com elvis presley
df2
email_address street city
0 john.smith#email.com street1 NY
1 john.smith#email.com street1 NY
2 elvis#email.com street2 LA
merged_df
email_address name surname street city
0 john.smith#email.com john smith street1 NY
1 john.smith#email.com john smith street1 NY
2 john.smith#email.com john smith street1 NY
3 john.smith#email.com john smith street1 NY
4 elvis#email.com elvis presley street2 LA
5 elvis#email.com elvis presley street2 LA
My question is, shouldn't it be like this?
This is how I would like my merged_df to be like.
email_address name surname street city
0 john.smith#email.com john smith street1 NY
1 john.smith#email.com john smith street1 NY
2 elvis#email.com elvis presley street2 LA
Are there any ways I can achieve this?

list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.

DO NOT drop duplicates BEFORE the merge, but after!
Best solution is do the merge and then drop the duplicates.
In your case:
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
merged_df.drop_duplicates(subset=['email_address'], keep='first', inplace=True, ignore_index=True)

Related

Split pandas dataframe column of type string into multiple columns based on number of ',' characters

Let's say I have a pandas dataframe that looks like this:
import pandas as pd
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar']}
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
I know I can split the names into separate columns using the comma as separator, like so:
df[["name1", "name2", "name3"]] = df["name"].str.split(", ", expand=True)
df
name name1 name2 name3
0 Tom, Jeffrey, Henry Tom Jeffrey Henry
1 Nick, James Nick James None
2 Chris Chris None None
3 David, Oscar David Oscar None
However, if the name column would have a row that contains 4 names, like below, the code above will yield a ValueError: Columns must be same length as key
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar', 'Jim, Jones, William, Oliver']}
# Create DataFrame
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
4 Jim, Jones, William, Oliver
How can automatically split the name column into n-number of separate columns based on the ',' separator? The desired output would be this:
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver

Use DataFrame.join for new DataFrame with rename for new columns names:
f = lambda x: f'name{x+1}'
df = df.join(df["name"].str.split(", ", expand=True).rename(columns=f))
print (df)
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver

How to find records with same value in one column but different value in another column

I have two pandas df with the exact same column names. One of these columns is named id_number which is unique to each table (What I mean is an id_number can only appear once in each df). I want to find all records that have the same id_number but have at least one different value in any column and store these records in a new pandas df.
I've tried merging (more specifically inner join), but it keeps only one record with that specific id_number so I can't look for any differences between the two dfs.
Let me provide some example to provide a clearer explanation:
Example dfs:
First DF:
id_number name type city
1 John dev Toronto
2 Alex dev Toronto
3 Tyler dev Toronto
4 David dev Toronto
5 Chloe dev Toronto
Second DF:
id_number name type city
1 John boss Vancouver
2 Alex dev Vancouver
4 David boss Toronto
5 Chloe dev Toronto
6 Kyle dev Vancouver
I want the resulting df to contain the following records:
id_number name type city
1 John dev Toronto
1 John boss Vancouver
2 Alex dev Toronto
2 Alex dev Vancouver
4 David dev Toronto
4 David Boss Toronto
NOTE: I would not want records with id_number 5 to appear in the resulting df, that is because the records with id_number 5 are exactly the same in both dfs.
In reality, there are 80 columns for each record, but I think these tables make my point a little clearer. Again to summarize, I want the resulting df to contain records with same id_numbers, but a different value in any of the other columns. Thanks in advance for any help!

Here is one way using nunique then we pick those id_number more than 1 and slice them out
s = pd.concat([df1, df2])
s = s.loc[s.id_number.isin(s.groupby(['id_number']).nunique().gt(1).any(1).loc[lambda x : x].index)]
s
Out[654]:
id_number name type city
0 1 John dev Toronto
1 2 Alex dev Toronto
3 4 David dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Vancouver
2 4 David boss Toronto

Here is, a way using pd.concat, drop_duplicates and duplicated:
pd.concat([df1, df2]).drop_duplicates(keep=False).sort_values('id_number')\
.loc[lambda x: x.id_number.duplicated(keep=False)]
Output:
id_number name type city
0 1 John dev Toronto
0 1 John boss Vancouver
1 2 Alex dev Toronto
1 2 Alex dev Vancouver
3 4 David dev Toronto
2 4 David boss Toronto

How to create conditional clause if column in dataframe is empty?

I have a df that looks like this:
fname lname
joe smith
john smith
jane#jane.com
jacky /jax jack
a#a.com non
john (jack) smith
Bob J. Smith
I want to create logic that says that if lname is empty, and if there are two OR three strings in fname seperate the second string OR third string and push it into lname column. If email address in fname leave as is, and if slashes or parenthesis in the fname column and no value in lname leave as is.
new df:
fname lname
joe smith
john smith
jane#jane.com
jacky /jax jack
a#a.com non
john (jack) smith
Bob J. smith
Code so far to seperate two strings:
df[['lname']] = df['name'].loc[df['fname'].str.split().str.len() == 2].str.split(expand=True)

With the following sample dataframe:
df = pd.DataFrame({'fname': ['joe', 'john smith', 'jane#jane.com', 'jacky /jax', 'a#a.com', 'john (jack)', 'Bob J. Smith'],
'lname': ['smith', '', '', 'jack', 'non', 'smith', '']})
You can use np.where():
conditions = (df['lname']=='') & (df['fname'].str.split().str.len()>1)
df['lname'] = np.where(conditions, df['fname'].str.split().str[-1].str.lower(), df['lname'])
Yields:
fname lname
0 joe smith
1 john smith smith
2 jane#jane.com
3 jacky /jax jack
4 a#a.com non
5 john (jack) smith
6 Bob J. Smith smith
To remove the last string from the fname column of the rows that had their lname column populated:
df['fname'] = np.where(conditions, df['fname'].str.split().str[:-1].str.join(' '), df['fname'])
Yields:
fname lname
0 joe smith
1 john smith
2 jane#jane.com
3 jacky /jax jack
4 a#a.com non
5 john (jack) smith
6 Bob J. smith

If I understand correctly you have a dataframe with columns fname and lname. If so then you can modify empty rows in column lname with:
condition = (df.loc[:, 'lname'] == '') & (df.loc[:, 'fname'].str.contains(' '))
df.loc[condition, 'lname'] = df.loc[condition, 'fname'].str.split().str[-1]
The code works for the sample data you have provided in the question but should be improved to be used in more general case.
To modify column fname you may use:
df.loc[condition, 'fname'] = df.loc[condition, 'fname'].str.split().str[:-1].str.join(sep=' ')

Make Pandas Dataframe column equal to value in another Dataframe based on index

I have 3 dataframes as below
df1
id first_name surname state
1
88
190
2509
....
df2
id given_name surname state street_num
17 John Doe NY 5
88 Tom Murphy CA 423
190 Dave Casey KY 250
....
df3
id first_name family_name state car
1 John Woods NY ford
74 Tom Kite FL vw
2509 Mike Johnson KY toyota
Some id's from df1 are in df2 and others are in df3. There are also id's in df2 and df3 that are not in df1.
EDIT: there are also some id's in df1 that re not in either df2 or df3.
I want to fill the columns in df1 with the values in the dataframe containing the id. However, I do not want all columns (so i think merge is not suitable). I have tried to use the isin function but that way I could not update records individually and got an error. This was my attempt using isin:
df1.loc[df1.index.isin(df2.index), 'first_name'] = df2.given_name
Is there an easy way to do this without iterating through the dataframes checking if index matches?

I think you first need to rename your columns to align the DataFrames in concat and then reindex to filter by df1.index and df1.columns:
df21 = df2.rename(columns={'given_name':'first_name'})
df31 = df3.rename(columns={'family_name':'surname'})
df = pd.concat([df21, df31]).reindex(index=df1.index, columns=df1.columns)
print (df)
first_name surname state
d
1 John Woods NY
88 Tom Murphy CA
190 Dave Casey KY
2509 Mike Johnson KY
EDIT: If need intersection of indices only:
df4 = pd.concat([df21, df31])
df = df4.reindex(index=df1.index.intersection(df4.index), columns=df1.columns)

Merging two columns with different information, python

I have a dataframe with one column of last names, and one column of first names. How do I merge these columns so that I have one column with first and last names?
Here is what I have:
First Name (Column 1)
John
Lisa
Jim
Last Name (Column 2)
Smith
Brown
Dandy
This is what I want:
Full Name
John Smith
Lisa Brown
Jim Dandy.
Thank you!

Try
df.assign(name = df.apply(' '.join, axis = 1)).drop(['first name', 'last name'], axis = 1)
You get
name
0 bob smith
1 john smith
2 bill smith

Here's a sample df:
df
first name last name
0 bob smith
1 john smith
2 bill smith
You can do the following to combine columns:
df['combined']= df['first name'] + ' ' + df['last name']
df
first name last name combined
0 bob smith bob smith
1 john smith john smith
2 bill smith bill smith

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Duplicated rows when merging dataframes in Python - python

list_2_nodups = list_2.drop_duplicates() pd.merge(list_1 , list_2_nodups , on=['email_address']) The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.

DO NOT drop duplicates BEFORE the merge, but after! Best solution is do the merge and then drop the duplicates. In your case: merged_df = pd.merge(df1, df2, on=['email_address'], how='inner') merged_df.drop_duplicates(subset=['email_address'], keep='first', inplace=True, ignore_index=True)

Related

Split pandas dataframe column of type string into multiple columns based on number of ',' characters

How to find records with same value in one column but different value in another column

How to create conditional clause if column in dataframe is empty?

Make Pandas Dataframe column equal to value in another Dataframe based on index

Merging two columns with different information, python

Categories

Resources