Pandas merge two dataframes without duplicating column

Pandas merge two dataframes without duplicating column - python

My question is similar to Pandas Merge - How to avoid duplicating columns but I cannot find a solution for the specific example below.
I have DateFrame df:
Customer Address
J. Smith 10 Sunny Rd Timbuktu
and Dataframe emails:
Name Email
J. Smith j.smith#myemail.com
I want to merge the two dataframes to produce:
Customer Address Email
J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com
I am using the following code:
data_names = {'Name':data_col[1], ...}
mapped_name = data_names['Name']
df = df.merge(emails, how='inner', left_on='Customer', right_on=mapped_name)
The result is:
Customer Address Email Name
J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com J. Smith
While I could just delete the column named mapped_name, there is the possibility that the mapped_name could be 'Customer' and in that case I dont want to remove both Customer columns.
Any ideas?

I think you can rename first column in email dataframe to Customer, how='inner' can be omit because default value:
emails.columns = ['Customer'] + emails.columns[1:].tolist()
df = df.merge(emails, on='Customer')
print (df)
Customer Address Email
0 J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com
And similar solution as another answer - is possible rename first column selected by [0]:
df = df.merge(emails.rename(columns={emails.columns[0]:'Customer'}), on='Customer')
print (df)
Customer Address Email
0 J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com

You can just rename your email name column to 'Customer' and then merge. This way, you don't need to worry about dropping the column at all.
df.merge(emails.rename(columns={mapped_name:'Customer'}), how='inner', on='Customer')
Out[53]:
Customer Address Email
0 J. Smith 10 Sunny Rd Timbuktu j.smith#myemail.com

Related

Compare two data-frames with different column names and update first data-frame with the column from second data-frame

I am working on two data-frames which have different column names and dimensions.
First data-frame "df1" contains single column "name" that has names need to be located in second data-frame. If matched, value from df2 first column df2[0] needs to be returned and added in the result_df
Second data-frame "df2" has multiple columns and no header. This contains all the possible diminutive names and full names. Any of the column can have the "name" that needs to be matched
Goal: Locate the name in "df1" in "df2" and if it is matched, return the value from first column of the df2 and add in the respective row of df1
df1
name
ab
alex
bob
robert
bill
df2
0
1
2
3
abram
ab
robert
rob
bob
robbie
alexander
alex
al
william
bill
result_df
name
matched_name
ab
abram
alex
alexander
bob
robert
robert
robert
bill
william
The code i have written so far is giving error. I need to write it as an efficient code as it will be checking millions of entries in df1 with df2:
'''
result_df = process_name(df1, df2)
def process_name(df1, df2):
for elem in df2.values:
if elem in df1['name']:
df1["matched_name"] = df2[0]
'''

Try via concat(),merge(),drop() and rename() and reset_index() method:
df=(pd.concat((df1.merge(df2,left_on='name',right_on=x) for x in df2.columns))
.drop(['1','2','3'],1)
.rename(columns={'0':'matched_name'})
.reset_index(drop=True))
Output of df:
name matched_name
0 robert robert
1 ab abram
2 alex alexander
3 bill william
4 bob robert

Is there a way to groupby concatenating multiple strings?

I have the following df:
Name Role Company [other columns]
John Admin GM
John Director Kodak
John Partner McDonalds
Mark Director Gerdau
Mark Partner Kibon
I want to turn it into:
Name Companies [other columns]
John GM (Admin), Kodak (Director), McDonalds (Partner)
Mark Gerdau (Director), Kibon (Partner
I think the answer is somewhere in the groupby field, this question is almost there, however I need to find a way to do that iterating two columns and putting the parenthesis in place.

IIUC assign and groupby
df1 = df.assign(companies=df['Company'] + ' (' + df['Role'] + ')')\
.groupby('Name')['companies'].agg(','.join)
print(df1)
Name
John GM (Admin),Kodak (Director),McDonalds (Partner)
Mark Gerdau (Director),Kibon (Partner)
Name: companies, dtype: object

Break up a data-set into separate excel files based on a certain row value in a given column in Pandas?

I have a fairly large dataset that I would like to split into separate excel files based on the names in column A ("Agent" column in the example provided below). I've provided a rough example of what this data-set looks like in Ex1 below.
Using pandas, what is the most efficient way to create a new excel file for each of the names in column A, or the Agent column in this example, preferably with the name found in column A used in the file title?
For example, in the given example, I would like separate files for John Doe, Jane Doe, and Steve Smith containing the information that follows their names (Business Name, Business ID, etc.).
Ex1
Agent Business Name Business ID Revenue
John Doe Bobs Ice Cream 12234 $400
John Doe Car Repair 445848 $2331
John Doe Corner Store 243123 $213
John Doe Cool Taco Stand 2141244 $8912
Jane Doe Fresh Ice Cream 9271499 $2143
Jane Doe Breezy Air 0123801 $3412
Steve Smith Big Golf Range 12938192 $9912
Steve Smith Iron Gyms 1231233 $4133
Steve Smith Tims Tires 82489233 $781
I believe python / pandas would be an efficient tool for this, but I'm still fairly new to pandas, so I'm having trouble getting started.

I would loop over the groups of names, then save each group to its own excel file:
s = df.groupby('Agent')
for name, group in s:
group.to_excel(f"{name}.xls")

Use lise comprehension with groupby on agent column:
dfs = [d for _,d in df.groupby('Agent')]
for df in dfs:
print(df, '\n')
Output
Agent Business Name Business ID Revenue
4 Jane Doe Fresh Ice Cream 9271499 $2143
5 Jane Doe Breezy Air 123801 $3412
Agent Business Name Business ID Revenue
0 John Doe Bobs Ice Cream 12234 $400
1 John Doe Car Repair 445848 $2331
2 John Doe Corner Store 243123 $213
3 John Doe Cool Taco Stand 2141244 $8912
Agent Business Name Business ID Revenue
6 Steve Smith Big Golf Range 12938192 $9912
7 Steve Smith Iron Gyms 1231233 $4133
8 Steve Smith Tims Tires 82489233 $781

Grouping is what you are looking for here. You can iterate over the groups, which gives you the grouping attributes and the data associated with that group. In your case, the Agent name and the associated business columns.
Code:
import pandas as pd
# make up some data
ex1 = pd.DataFrame([['A',1],['A',2],['B',3],['B',4]], columns = ['letter','number'])
# iterate over the grouped data and export the data frames to excel workbooks
for group_name,data in ex1.groupby('letter'):
# you probably have more complicated naming logic
# use index = False if you have not set an index on the dataframe to avoid an extra column of indices
data.to_excel(group_name + '.xlsx', index = False)

Use the unique values in the column to subset the data and write it to csv using the name:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_csv(f"{unique_val}.csv")
if you need excel:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_excel(f"{unique_val}.xlsx")

Matching and Joining Two Inconsistent DataFrames

I have two dataframes that are being queried off two separate databases that share common characteristics, but not always the same characteristics, and I need to find a way to reliably join the two together.
As an example:
import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)
Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse
inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)
Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer
I need to figure out a way to reliably join these two dataframes without the data always being consistent. To further complexify the problem, the two databases are not always the same length. Any ideas?

you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:
# do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
# then concat your dataframes
new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
df.merge(df2, on=['Name', 'Occupation']),
df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)
# take the first dataframe, which is complete, and merge with your new_df and drop dups
df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()
Age Location Mothers Name Name Occupation Favorite Hobby
0 12 Frankfurt Rosy Jose Student Sewing
2 23 Maui Amy Katherine Lawyer Reading
4 22 Dallas Monica Larry Nurse Basketball
This assumes that each rows age and location are unique

Fill pandas dataframe rows from values in another dataframe rows

I have two pandas dataframes as given below:
df1
Name City Postal_Code State
James Phoenix 85003 AZ
John Scottsdale 85259 AZ
Jeff Phoenix 85003 AZ
Jane Scottsdale 85259 AZ
df2
Postal_Code Income Category
85003 41038 Two
85259 104631 Four
I would like to insert two columns, Income and Category, to df1 by capturing the values for Income and Category from df2 corresponding to the postal_code for each row in df1.
The closest question that I could find in SO was this - Fill DataFrame row values based on another dataframe row's values pandas. But, the pd.merge solution does not solve the problem for me. Specifically, I used
pd.merge(df1,df2,on='postal_code',how='outer')
All I got was nan values in the two new columns. Not sure whether this is because the no of rows for df1 and df2 are different. Any suggestions to solve this problem?

you just have the wrong how, use 'inner' instead. This matches where keys exist in both dataframes
df1.Postal_Code = df1.Postal_Code.astype(int)
df2.Postal_Code = df2.Postal_Code.astype(int)
df1.merge(df2,on='Postal_Code',how='inner')
Name City Postal_Code State Income Category
0 James Phoenix 85003 AZ 41038 Two
1 Jeff Phoenix 85003 AZ 41038 Two
2 John Scottsdale 85259 AZ 104631 Four
3 Jane Scottsdale 85259 AZ 104631 Four

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas merge two dataframes without duplicating column - python

Related

Compare two data-frames with different column names and update first data-frame with the column from second data-frame

Is there a way to groupby concatenating multiple strings?

Break up a data-set into separate excel files based on a certain row value in a given column in Pandas?

Matching and Joining Two Inconsistent DataFrames

Fill pandas dataframe rows from values in another dataframe rows

Categories

Resources