Pandas: Merge two dataframes with duplicate rows - python

Short Question
Within Pandas, what's the most convenient way to merge two dataframes, such that all entries in the left dataframe receive the first matching value from the right dataframe?
Longer Question
Say I have two spreadsheets: people.csv and orders.csv. people.csv contains several columns of information on the person, whereas orders.csv contains the person's full name, and a row of the # of orders that person placed.
I need to create a third csv, output.csv which contains all of the columns from people.csv plus a column from output.csv matched on one of the columns within both spreadsheets (called "FULL_NAME" in one, and "CUSTOMER_FULL_NAME" in the other)
people.csv is sorted on the FULL_NAME field, but contains duplicate rows, such that there are multiple rows with "John Smith" in the FULL_NAME column. There are also duplicate rows within orders.csv but not the same number of duplicates (for example, people.csv may have 4 John Smith entries, but orders.csv may only have two).
If I use the following code:
people = pd.read_csv('people.csv')
orders = pd.read_csv('orders.csv')
full = pd.merge(
people,
orders,
left_on='FULL_NAME',
right_on='CUSTOMER_FULL_NAME',
)
result.to_csv("output.csv")
... I get a CSV where only two of the rows with "John Smith" in the FULL_NAME field has John Smith's number of orders. The rows directly below have no value in that field. That's because output.csv only contained two rows with matching values for John Smith, whereas people.csv had 4.
Is there a convenient method within Pandas to set the value of one column to the first matching column in the other dataframe, such that all 4 entries contain the first matching value from orders.csv?
EDIT
Full current version of my script, returning CSVs containing rows not set with the expected values:
import pandas as pd
community = pd.read_csv("orders.csv")
full = pd.read_csv("people.csv")
result = pd.merge(
full,
community.drop_duplicates(subset=['FULL_NAME'], keep='first'),
left_on="CUSTOMER_FULL_NAME",
right_on="FULL_NAME",
how='left',
)
result.to_csv("output.csv")
So I think I'm missing something else here, because some of the rows are matching in the expected way. Here's an example from the ouput file:
ID FULL_NAME EMPLOYER DIVISION ORDER #
7350 John Smith RiteAid Clinical Research 25
7351 John Smith RiteAid Clinical Research 25
7352 John Smith Costco Sales
7353 John Smith Costco Sales
This John Smith rows doesn't have a duplicate value within the orders.csv file so I do think this is working since two of the rows got it. However, I didn't get a match on the John Smith rows that list Costco rather than RiteAid (or other different fields). This surprises me since I thought the index check was only on the FULL_NAME field.
Any ideas on why the other rows might not be filled in?

You can use drop_duplicates on subset=['CUSTOMER_FULL_NAME'] in the merge with how='left' to keep all rows from people such as:
full = pd.merge(
people,
orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first'), #here the differance
left_on='FULL_NAME',
right_on='CUSTOMER_FULL_NAME',
how='left' #and add the how='left'
)
So orders.drop_duplicates(subset=['CUSTOMER_FULL_NAME'], keep='first') will contain only once each name and during the merge, the matching will be with only this unique name

Related

How do I coalesce Pandas columns only where the beginnings of the columns don't match?

I have a table with some company information that we're trying to clean up. In the first column is a clean company name, but not necessarily the correct one. In the second column, there is the correct company name, but often not very clean / missing. Here is an example.
Name
Info
Nike
Nike, a footwear manufacturer is headquartered in Oregon.
ASG Shoes
Reebok
Adidas
None
We're working with this dataset in Pandas. We'd like to follow the rules below.
If the Name column is equal to the left side of the Info column, keep the name column. We would like this to be dynamic with the length of column 1. For "Nike", it should check the first 4 letters of the Info column, for "ASG Shoes", it should check the first 9 characters.
If rule 1 is false, use the Info column.
If Info is None, use the Name column.
The output we seek is a 3rd column that is the output of these rules. I am hoping someone can help me with writing this code in an efficient manner. There's a lot going on here and I want to ensure I'm doing this properly. How can I achieve this output with the most efficient Python code possible?
Name
Info
Clean
Nike
Nike, a footwear manufacturer is headquartered in Oregon.
Nike
ASG Shoes
Reebok
Reebok
Adidas
None
Adidas
You can start by creating another column that contains the length of your Name column. This is really straight-forward. Let us call the new column Slicers. What you can then do is to create a function that slices a string by a certain number and map this function to your columns Info and Slicers, where Info is the string column that should be sliced and Slicers defines the slicing number. (There may be even a pandas implementation for this, but I do not know one). After that, you can compare your sliced info with your Name variable and assign all matches to your Clean column. Then, just apply a pandas coalesce over your desired columns.
The code implementation is given below:
import pandas as pd
def slicer(strings, slicers):
return strings[:slicers] if isinstance(strings, str) else strings
df = pd.DataFrame({
"Name": ["Nike", "ASG Shoes", "Adidas"],
"Info": ["Nike, a footwear manufacturer is headquartered in Oregon.", "Reebok", None]
})
# Define length column
df["Slicers"] = df["Name"].str.len()
# Slice Info column by length column and overwrite
df["Slicers"] = list(map(slicer, df["Info"], df["Slicers"]))
# Check whether sliced str column and name column are equal
mask = df["Name"].eq(df["Slicers"])
# Overwrite if they are equal
df.loc[mask, "Clean"] = df.loc[mask, "Name"]
# Apply coalesce
coalesce_rules = ["Clean", "Info", "Name"]
df.drop(columns=["Slicers"]).assign(Clean=df[coalesce_rules].fillna(method="bfill", axis=1).iloc[:,0])
Output:
Name Info Clean
0 Nike Nike, a footwear manufacturer is headquartered... Nike
1 ASG Shoes Reebok Reebok
2 Adidas None Adidas
It only needs around five seconds for 3. Mio rows. Obviously, I do not know whether this is the most efficient way to solve your problem. But I think it's an efficient one.

How can I compare two pandas dataframes when the rows of one dataframe are a substring of the rows of another dataframe?

I have two dataframes, with the rows of the dataframes corresponding to each other by one dataframe having its rows be a substring of another dataframe. An example can be seen with this:
Donald
Bobby
Christian
7:15
6:43
5:05
Don
Bob
Chris
7:13
6:46
5:00
Essentially. Don is a substring of Donald, Bob a substring of Bobby, etc. and these rows are subsequently corresponding. I want to figure out a way to compare the corresponding rows (like how 7:15 for Donald is different from 7:13 for Don). Ideally, I would be able to have a new dataframe that posits each string and substring and then the corresponding differences next to each other. (Would look like Donald, Don, 7:15, 7:13). To be clear, Donald/Don, Bobby/Bob, Christian/Chris are all rows and not columns. The columns would be like Name, Mile Time and then each row is name and time. I am currently thinking of iterating through the dataframe with the longer names by doing
for index, rows in df2.iterrows():
substring = rows['Name']
data = df1.loc[df1['Name'].str.contains(substring, case=False)]
if data is not None:
name1 = data['Name']
time1 = data['Mile Time']
name2 = rows['Name']
time2 = rows['Mile Time']
But I'm not entirely sure how many different columns will be in each row because there might be multiple attributes like age, sex, etc. so I'm not sure how to go about collecting that data and putting it into something meaningful to compare the two visually. Ideally it goes into a new dataframe and convet to csv so then I can visually see the differences side by side.
EDIT: Posting with DataFrames and Desired Output
full_name_dict = {'Name': [Donald, Bobby, Christian], 'Mile Time': [7:15, 6:43, 5:05]}
substring_dict = {'Name': [Don, Bob, Chris], 'Mile Time': [7:13, 6:46, 5:00]}
df1 = pd.DataFrame(full_name_dict)
df2 = pd.DataFrame(substring_dict)
Desired Output...
output_wanted = {'Name1': [Donald, Bobby, Christian], 'Name2': [Don, Bob,
Chris], 'Time1': [7:15, 6:43, 5:05], 'Time2': [7:13, 6:46, 5:00]}

Find which column has unique values that can help distinguish the rows with Pandas

I have the following dataframe, which contains 2 rows:
index name food color number year hobby music
0 Lorenzo pasta blue 5 1995 art jazz
1 Lorenzo pasta blue 3 1995 art jazz
I want to write a code that will be able to tell me which column is the one that can distinguish between the these two rows.
For example , in this dataframe, the column "number" is the one that distinguish between the two rows.
Unti now I have done this very simply by just go over column after column using iloc and see the values.
duplicates.iloc[:,3]
>>>
0 blue
1 blue
It's important to take into account that:
This should be for loop, each time I check it on new generated dataframe.
There may be nore than 2 rows which I need to check
There may be more than 1 column that can distinguish between the rows.
I thought that the way to check such a thing will be something like take each time one column, get the unique values and check if they are equal to each other ,similarly to this:
for n in np.arange(0,len(df.columns)):
tmp=df.iloc[:,n]
and then I thought to compare if all the values are similar to each other on the temporal dataframe, but here I got stuck because sometimes I have many rows and also I need.
My end goal: to be able to check inside for loop to identify the column that has different values in each row of the temporaldtaframe, hence can help to distinguish between the rows.
You can apply the duplicated method on all columns:
s = df.apply(pd.Series.duplicated).any()
s[~s].index
Output: ['number']

Vlookup based on multiple columns in Python DataFrame

I have two dataframes. I am trying to Vlookup 'Hobby' column from the 2nd dataframe and update the 'Interests' column of the 1st dataframe. Please note that the columns: Key, Employee and Industry should match exactly between the two dataframes but in case of City, even if the first part of the city matches between the two dataframe it should be acceptable. Though, it is starightforward in Excel, it looks a bit complicated to implement it on Python. Any cue on how to proceed will be really helpful. (Please see below the screenshot for the expected output.)
data1=[['AC32456','NYC-URBAN','JON','BANKING','SINGING'],['AD45678','WDC-RURAL','XING','FINANCE','DANCING'],
['DE43216', 'LONDON-URBAN','EDWARDS','IT','READING'],['RT45327','SINGAPORE-URBAN','WOLF','SPORTS','WALKING'],
['Rs454457','MUMBAI-RURAL','NEMBIAR','IT','ZUDO']]
data2=[['AH56245','NYC','MIKE','BANKING','BIKING'],['AD45678','WDC','XING','FINANCE','TREKKING'],
['DE43216', 'LONDON-URBAN','EDWARDS','FINANCE','SLEEPING'],['RT45327','SINGAPORE','WOLF','SPORTS','DANCING'],
['RS454457','MUMBAI','NEMBIAR','IT','ZUDO']]
List1=['Key','City','Employee', 'Industry', 'Interests']
List2=['Key','City','Employee', 'Industry', 'Hobby']
df1=pd.DataFrame(data1, columns=List1)
df2=pd.DataFrame(data2,columns=List2)
Set in index of df1 to Key (you can set the index to whatever you want to match on) and the use update:
# get the first part of the city
df1['City_key'] = df1['City'].str.split('-', expand=True)[0]
df2['City_key'] = df2['City'].str.split('-', expand=True)[0]
# set index
df1 = df1.set_index(['Key', 'Employee', 'Industry', 'City_key'])
# update
df1['Interests'].update(df2.set_index(['Key', 'Employee', 'Industry', 'City_key'])['Hobby'])
# reset index and drop the City_key column
new_df = df1.reset_index().drop(columns=['City_key'])
Key Employee Industry City Interests
0 AC32456 JON BANKING NYC-URBAN SINGING
1 AD45678 XING FINANCE WDC-RURAL TREKKING
2 DE43216 EDWARDS IT LONDON-URBAN READING
3 RT45327 WOLF SPORTS SINGAPORE-URBAN DANCING
4 Rs454457 NEMBIAR IT MUMBAI-RURAL ZUDO

Pandas - Split and refactor and overloaded ID column

I have a pandas DataFrame with columns patient_id, patient_sex, patient_dob (and other less relevant columns). Rows can have duplicate patient_ids, as each patient may have more than one entry in the data for multiple medical procedures. I discovered, however, that a great many of the patient_ids are overloaded, i.e. more than one patient has been assigned to the same id (evidenced by many instances of a single patient_id being associated with multiple sexes and multiple days of birth).
To refactor the ids so that each patient has a unique one, my plan was to group the data not only by patient_id, but by patient_sex and patient_dob as well. I figure this must be sufficient to separate the data into individual users (and if two patients with the same sex and dob just happened to be assigned the same id, then so be it.
Here is the code I currently use:
# I just use first() here as a way to aggregate the groups into a DataFrame.
# Bonus points if you have a better solution!
indv_patients = patients.groupby(['patient_id', 'patient_sex', 'patient_dob']).first()
# Create unique ids
new_patient_id = 'new_patient_id'
for index, row in indv_patients.iterrows():
# index is a tuple of the three column values, so this should get me a unique
# patient id for each patient
indv_patients.loc[index, new_patient_id] = str(hash(index))
# Merge new ids into original patients frame
patients_with_new_ids = patients.merge(indv_patients, left_on=['patient_id', 'patient_sex', 'patient_dob'], right_index=True)
# Remove byproduct columns, and original id column
drop_columns = [col for col in patients_with_new_ids.columns if col not in patients.columns or col == new_patient_id]
drop_columns.append('patient_id')
patients_with_new_ids = patients_with_new_ids.drop(columns=drop_columns)
patients = patients_with_new_ids.rename(columns={new_patient_id : 'patient_id'})
The problem is that with over 7 million patients, this is way too slow a solution, the biggest bottleneck being the for-loop. So my question is, is there a better way to fix these overloaded ids? (The actual id doesn't matter, so long as its unique for each patient)
I don't know what the values for the columns are but have you tried something like this?
patients['new_patient_id'] = patients.apply(lambda x: x['patient_id'] + x['patient_sex'] + x['patient_dob'],axis=1)
This should create a new column and you can then use groupby with the new_patient_id

Categories

Resources