Matching and Joining Two Inconsistent DataFrames - python

I have two dataframes that are being queried off two separate databases that share common characteristics, but not always the same characteristics, and I need to find a way to reliably join the two together.
As an example:
import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)
Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse
inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)
Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer
I need to figure out a way to reliably join these two dataframes without the data always being consistent. To further complexify the problem, the two databases are not always the same length. Any ideas?

you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:
# do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
# then concat your dataframes
new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
df.merge(df2, on=['Name', 'Occupation']),
df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)
# take the first dataframe, which is complete, and merge with your new_df and drop dups
df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()
Age Location Mothers Name Name Occupation Favorite Hobby
0 12 Frankfurt Rosy Jose Student Sewing
2 23 Maui Amy Katherine Lawyer Reading
4 22 Dallas Monica Larry Nurse Basketball
This assumes that each rows age and location are unique

Related

Merge two tables and create a column based on the similarity between two columns

I have two tables A/B. I want to join merge table A to B on a loop create.
Input
Table A: Table B:
name year name subject group
James one Jenny Physics
Jackson one Jackson Maths
Jenny two Jenny PE
Himeth three Himeth Chemistry
Carlos three Carlos Physics
Mendy one Mendy German
James two James Physics
Himeth one Himeth Chemistry
Output
Name subject group year_one year_two year_three
Jenny PE Yellow two
Jackson Maths Green one
James Physics Yellow one two
Himeth Chemistry Yellow one three
Carlos Physics Green three
Mendy German Yellow one
I want to add year column from table A but create columns for year (one,two,three) as there are duplicates.
years = ["one", "two", "three"]
for i in years:
df.to_csv("student.csv")
dfs = pd.merge(df,students,
on = "Name",
how = "left")
dfs[i] = i
dfs.to_csv("class_setup.csv")
I'm not sure how to loop over the merge for and create a new column.
Use DataFrame.pivot with DataFrame.join:
df = df2.join(df1.pivot('name','year','year'), on='name').fillna('')
print (df)
name one three two
0 Jenny two
1 Jackson one
2 Jenny two
3 Himeth one three
4 Carlos three
5 Mendy one
6 James one two
7 Himeth one three
If order is important:
years = ["one", "two", "three"]
df = df2.join(df1.pivot('name','year','year').reindex(years, axis=1), on='name').fillna('')
print (df)
name one two three
0 Jenny two
1 Jackson one
2 Jenny two
3 Himeth one three
4 Carlos three
5 Mendy one
6 James one two
7 Himeth one three

Faster way to query & compute in Pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes in Pandas. What I want achieve is, grab every 'Name' from DF1 and get the corresponding 'City' and 'State' present in DF2.
For example, 'Dwight' from DF1 should return corresponding values 'Miami' and 'Florida' from DF2.
DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
. . . .
70000 Jim 27 Yes
DF1 has approx 70,000 rows with 3 columns
Second Dataframe, DF2 has approx 320,000 rows.
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
. . . . .
325082 Jim Scranton Pennsylvania
Currently I have two functions, which return the values of 'City' and 'State' using a filter.
def read_city(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['City'].values[0])
else:
field = ""
return field
def read_state(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['State'].values[0])
else:
field = ""
return field
I am using the apply function to process all the values.
df['city_list'] = df['Name'].apply(read_city)
df['State_list'] = df['Name'].apply(read_state)
The result takes a long time to compute in the above way. It roughly takes me around 18 minutes to get back the df['city_list'] and df['State_list'].
Is there a faster to compute this ? Since I am completely new to pandas, I would like to know if there is a efficient way to compute this ?
I believe you can do a map:
s = df2.groupby('name')[['City','State']].agg(list)
df['city_list'] = df['Name'].map(s['City'])
df['State_list'] = df['Name'].map(s['State'])
Or a left merge after you got s:
df = df.merge(s.add_suffix('_list'), left_on='Name', right_index=True, how='left')
I think you can do something like this:
# Dataframe DF1 (dummy data)
DF1 = pd.DataFrame(columns=['Name', 'Age', 'Student'], data=[['Dwight', 20, 'Yes'], ['Michael', 30, 'No'], ['Pam', 55, 'No'], ['Jim', 27, 'Yes']])
print("DataFrame DF1")
print(DF1)
# Dataframe DF2 (dummy data)
DF2 = pd.DataFrame(columns=['Name', 'City', 'State'], data=[['Dwight', 'Miami', 'Florida'], ['Michael', 'Scranton', 'Pennsylvania'], ['Pam', 'Austin', 'Texas'], ['Jim', 'Scranton', 'Pennsylvania']])
print("DataFrame DF2")
print(DF2)
# You do a merge on 'Name' column and then, you change the name of columns 'City' and 'State'
df = pd.merge(DF1, DF2, on=['Name']).rename(columns={'City': 'city_list', 'State': 'State_list'})
print("DataFrame final")
print(df)
Output:
DataFrame DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
3 Jim 27 Yes
DataFrame DF2
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
3 Jim Scranton Pennsylvania
DataFrame final
Name Age Student city_list State_list
0 Dwight 20 Yes Miami Florida
1 Michael 30 No Scranton Pennsylvania
2 Pam 55 No Austin Texas
3 Jim 27 Yes Scranton Pennsylvania

Replacing values from one dataframe to another

I'm trying to fix discrepancies in a column from one df to a column in another.
The tables are not sorted as well.
How can i do this using python. Example:
df1
Age Name
40 Sid Jones
50 Alex, Bot
32 Tony Jar
65 Fred, Smith
24 Brad, Mans
df2
Age Name
24 Brad Mans
32 Tony Jar
40 Sid Jones
65 Fred Smith
50 Alex Bot
I need to replace the values in df2 to match those in df1 as you can see in my example the discrepancies are commas in the names.
Expected outcome for df2:
Age Name
24 Brad, Mans
32 Tony Jar
40 Sid Jones
65 Fred, Smith
50 Alex, Bot
The values in df2 should be changed to match the df1s values.
Create a column in df1 with commas removed from the Name column
df1['Name_nocomma'] = df1.Name.str.replace(',', '')
merge df1 to df2 using Name_nocomma & Name to get the corrected Name create a new version of df2
df2_out = df2.merge(df1, left_on='Name', right_on='Name_nocomma', how='left')[['Age_x', 'Name_x', 'Name_y']]
use combine_first to coalesce Name_y & Name_x into a new column Name
df2_out['Name'] = df2_out.Name_y.combine_first(df2_out.Name_x)
drop / rename the intermediate columns
del df1['Name_nocomma']
del df2_out['Name_x']
del df2_out['Namy_y']
df2_out.rename({'Age_x': 'Age'}, axis=1, inplace=True)
df2_out
#outputs:
Age Name
0 24 Brad Mans
1 32 Tony Jar
2 40 Sid Jones
3 65 Fred Smith
4 50 Alex Bot
you need sort and append
df1.sort(by=['Age'], inplace = True)
df2.sort(by=['Age'], inplace = True)
result_df = df1.append(df2)

Make Pandas Dataframe column equal to value in another Dataframe based on index

I have 3 dataframes as below
df1
id first_name surname state
1
88
190
2509
....
df2
id given_name surname state street_num
17 John Doe NY 5
88 Tom Murphy CA 423
190 Dave Casey KY 250
....
df3
id first_name family_name state car
1 John Woods NY ford
74 Tom Kite FL vw
2509 Mike Johnson KY toyota
Some id's from df1 are in df2 and others are in df3. There are also id's in df2 and df3 that are not in df1.
EDIT: there are also some id's in df1 that re not in either df2 or df3.
I want to fill the columns in df1 with the values in the dataframe containing the id. However, I do not want all columns (so i think merge is not suitable). I have tried to use the isin function but that way I could not update records individually and got an error. This was my attempt using isin:
df1.loc[df1.index.isin(df2.index), 'first_name'] = df2.given_name
Is there an easy way to do this without iterating through the dataframes checking if index matches?
I think you first need to rename your columns to align the DataFrames in concat and then reindex to filter by df1.index and df1.columns:
df21 = df2.rename(columns={'given_name':'first_name'})
df31 = df3.rename(columns={'family_name':'surname'})
df = pd.concat([df21, df31]).reindex(index=df1.index, columns=df1.columns)
print (df)
first_name surname state
d
1 John Woods NY
88 Tom Murphy CA
190 Dave Casey KY
2509 Mike Johnson KY
EDIT: If need intersection of indices only:
df4 = pd.concat([df21, df31])
df = df4.reindex(index=df1.index.intersection(df4.index), columns=df1.columns)

Fill pandas dataframe rows from values in another dataframe rows

I have two pandas dataframes as given below:
df1
Name City Postal_Code State
James Phoenix 85003 AZ
John Scottsdale 85259 AZ
Jeff Phoenix 85003 AZ
Jane Scottsdale 85259 AZ
df2
Postal_Code Income Category
85003 41038 Two
85259 104631 Four
I would like to insert two columns, Income and Category, to df1 by capturing the values for Income and Category from df2 corresponding to the postal_code for each row in df1.
The closest question that I could find in SO was this - Fill DataFrame row values based on another dataframe row's values pandas. But, the pd.merge solution does not solve the problem for me. Specifically, I used
pd.merge(df1,df2,on='postal_code',how='outer')
All I got was nan values in the two new columns. Not sure whether this is because the no of rows for df1 and df2 are different. Any suggestions to solve this problem?
you just have the wrong how, use 'inner' instead. This matches where keys exist in both dataframes
df1.Postal_Code = df1.Postal_Code.astype(int)
df2.Postal_Code = df2.Postal_Code.astype(int)
df1.merge(df2,on='Postal_Code',how='inner')
Name City Postal_Code State Income Category
0 James Phoenix 85003 AZ 41038 Two
1 Jeff Phoenix 85003 AZ 41038 Two
2 John Scottsdale 85259 AZ 104631 Four
3 Jane Scottsdale 85259 AZ 104631 Four

Categories

Resources