compare two dataframes and get nearest matching dataframe - python

have two dataframes with columns
df1
name cell marks
tom 2 21862
df2
name cell marks passwd
tom 2 11111 2548
matt 2 158416 2483
2 21862 26846
How to compare df2 with df1 and get nearest matched data frames
expected_output:
df2
name cell marks passwd
tom 2 11111 2548
2 21862 26846
tried merge but data is dynamic. On one case name might change and in another case marks might change

You can try the following:
import pandas as pd
dict1 = {'name': ['tom'], 'cell': [2], 'marks': [21862]}
dict2 = {'name': ['tom', 'matt'], 'cell': [2, 2],
'marks': [21862, 158416], 'passwd': [2548, 2483]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
compare = df2.isin(df1)
df2 = df2.iloc[df2.where(compare).dropna(how='all').index]
print(df2)
Output:
name cell marks passwd
0 tom 2 21862 2548

You can use pandas.merge with the option indicator=True, filtering the result for 'both':
import pandas as pd
df1 = pd.DataFrame([['tom', 2, 11111]], columns=["name", "cell", "marks"])
df2 = pd.DataFrame([['tom', 2, 11111, 2548],
['matt', 2, 158416, 2483]
], columns=["name", "cell", "marks", "passwd"])
def compare_dataframes(df1, df2):
"""Find rows which are similar between two DataFrames."""
comparison_df = df1.merge(df2,
indicator=True,
how='outer')
return comparison_df[comparison_df['_merge'] == 'both'].drop(columns=["_merge"])
print(compare_dataframes(df1, df2))
Returns:
name cell marks passwd
0 tom 2 11111 2548

Related

Comparison of two data frames and getting the differences by two columns - Pandas

I would like to compare two data frames, for example
import pandas as pd
az_df = pd.DataFrame({'name': ['CR1', 'CR2'], 'age': [1, 5], 'dr':[1, 2]})[['name', 'age']]
za_df = pd.DataFrame({'name': ['CR2', 'CR1'], 'age': [2, 1], 'dr':[2, 4]})[['name', 'age']]
AZ_DF table:
name
age
dr
CR1
1
1
CR2
5
2
ZA_DF table:
name
age
dr
CR1
1
4
CR2
2
2
And I want to get the summary table of different values grouped by 'name' and 'age' columns between az_df and za_df, like:
name
only in AZ
only in ZA
CR2
5
2
So far, I did merged them,
merge = pd.merge(az_df, za_df, how='outer', indicator=True)
For az_df, different values are:
only_in_az = merge[merge['_merge'] == 'left_only']
And for za_df:
only_in_za = merge[merge['_merge'] == 'right_only']
However, I don't know how to build the summary table, I mentioned above, showing the different names' ages for az and za data frames.
Thank you.
Try this:
import pandas as pd
az_df = pd.DataFrame({'name': ['CR1', 'CR2'], 'age': [1, 5], 'dr':[1, 2]})[['name', 'age']]
za_df = pd.DataFrame({'name': ['CR2', 'CR1'], 'age': [2, 1], 'dr':[2, 4]})[['name', 'age']]
merge = pd.merge(az_df, za_df, on='name', how='outer')
merge.rename(columns={'age_x': 'only in AZ', 'age_y': 'only in ZA'}, inplace=True)
merge
name only in AZ only in ZA
0 CR1 1 1
1 CR2 5 2
If you want to remove duplicates:
merge = merge[merge['only in AZ'] != merge['only in ZA']]
merge
name only in AZ only in ZA
1 CR2 5 2
I think you can get what you want from the merge dataframe using pivot_table:
pd.pivot_table(merge.query('_merge != "both"'), values='age', index='name', columns='_merge').reset_index().rename_axis(None, axis=1).rename(columns={'left_only':'only in AZ','right_only':'only in ZA'})
name only in AZ only in ZA
0 CR2 5 2

Drop rows in a pandas dataframe by criteria from another dataframe

I have the following dataframe containing scores for a competition as well as a column that counts what number entry for each person.
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df
Then I have another table that stores data on the maximum number of entries that each person can have:
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2
I am trying to drop rows from df where the entry number is greater than the Limit according to each person in df2 so that my expected output is this:
If there are any ideas on how to help me achieve this that would be fantastic! Thanks
You can use pandas.merge to create another dataframe and drop columns by your condition:
df3 = pd.merge(df, df2, on="Name", how="left")
df3[df3["Entry_No"] <= df3["Limit"]][df.columns].reset_index(drop=True)
Name Score Entry_No
0 John 10 1
1 Jim 8 1
2 John 9 2
3 Jim 3 2
4 Jim 0 3
5 Jack 5 1
I used how="left" to keep the order of df and reset_index(drop=True) to reset the index of the resulting dataframe.
You could join the 2 dataframes, and then drop with a condition:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2 = df2.set_index('Name')
df = df.join(df2, on='Name')
df.drop(df[df.Entry_No>df.Limit].index, inplace = True)
gives the expected output

Combining columns for each row in a data frame

I would like to combine two columns: Column 1 + Column 2 and that for each row individually. Unfortunately it didn't work for me. How do i solve this?
import pandas as pd
import numpy as np
d = {'Nameid': [1, 2, 3, 1], 'Name': ['Michael', 'Max', 'Susan', 'Michael'], 'Project': ['S455', 'G874', 'B7445', 'Z874']}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
d2 = {'Nameid': [4, 2, 5, 1], 'Name': ['Petrova', 'Michael', 'Mike', 'Gandalf'], 'Project': ['Z845', 'Q985', 'P512', 'Y541']}
df2 = pd.DataFrame(data=d2)
display(df2.head(10))
df2['Dataframe']='df2'
What I tried
df_merged = pd.concat([df,df2])
df_merged.head(10)
df3 = pd.concat([df,df2])
df3['unique_string'] = df['Nameid'].astype(str) + df['Dataframe'].astype(str)
df3.head(10)
As you can see, he didn't combine every row. He probably only has the first combined with all of them. How can I combine the two columns row by row?
What I want
You can simply concat strings like this:
You don't need to do df['Dataframe'].astype(str)
In [363]: df_merged['unique_string'] = df_merged.Nameid.astype(str) + df_merged.Dataframe
In [365]: df_merged
Out[365]:
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
0 4 Petrova Z845 df2 4df2
1 2 Michael Q985 df2 2df2
2 5 Mike P512 df2 5df2
3 1 Gandalf Y541 df2 1df2
Please make sure you are using the df3 assign back to df3 ,also do reset_index
df3 = df3.reset_index()
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe'].astype(str)
Use df3 instead df, also ignore_index=True for default index is added:
df3 = pd.concat([df,df2], ignore_index=True)
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe']
print (df3)
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
4 4 Petrova Z845 df2 4df2
5 2 Michael Q985 df2 2df2
6 5 Mike P512 df2 5df2
7 1 Gandalf Y541 df2 1df2

Processing a dataframe with another dataframe

I have two data frames: df1 and df2. They both include information like 'ID', 'Name', 'Score' and 'Status', which I need is to update the 'Score' in df1 if that person's status in df2 is "Edit", and I also need to drop the row in df1 if that person's status in df2 is "Cancel".
For example:
dic1 = {'ID': [1, 2, 3],
'Name':['Jack', 'Tom', 'Annie'],
'Score':[20, 10, 25],
'Status':['New', 'New', 'New']}
dic2 = {'ID': [1, 2],
'Name':['Jack', 'Tom'],
'Score':[28, 10],
'Status':['Edit', 'Cancel']}
df1 = pd.DataFrame(dic1)
df2 = pd.DataFrame(dic2)
The output should be like:
ID Name Score Status
1 Jack 28 Edit
3 Annie 25 New
Any pointers or hints?
Use DataFrame.merge with left join first and then filter out Cancel rows and also columns ending with _ from original DataFrame:
df = df1.merge(df2, on=['ID','Name'], how='left', suffixes=('_', ''))
df = df.loc[df['Status'] != 'Cancel', ~df.columns.str.endswith('_')]
print (df)
ID Name Score Status
0 1 Jack 28 Edit
EDIT Add DataFrame.combine_first for repalce missing rows:
df = df1.merge(df2, on=['ID','Name'], how='left', suffixes=('', '_'))
df = df.loc[df['Status_'] != 'Cancel']
df1 = df.loc[:, df.columns.str.endswith('_')]
df = df1.rename(columns=lambda x: x.rstrip('_')).combine_first(df).drop(df1.columns, axis=1)
print (df)
ID Name Score Status
0 1.0 Jack 28.0 Edit
2 3.0 Annie 25.0 New
Use pandas.DataFrame.update commnad of pandas package.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
df1.update(df2)
print(df1)
df1 = df1[df1.Status != "Cancel"]
print(df1)

pandas replace items in df from mappings defined in separate df

Let's say I have the following two DataFrames:
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Johnny', 'Sara', 'Mike']})
df2 = pd.DataFrame({'name': [2, 1, 2]})
How can update df2 from the mappings defined in df1:
df2 = pd.DataFrame({'name': ['Sara', 'Johnny', 'Sara']})
I've done the following but there has to be a better way to do it:
id_to_name = {i: name for i, name in zip(df1['id'].tolist(), df1['name'].tolist())}
df2['name'] = df2['name'].map(id_to_name)
Another way :-) I learn it recently
df1.set_index('id').name.get(df2.name)
Out[381]:
id
2 Sara
1 Johnny
2 Sara
Name: name, dtype: object
You can pass a Series/dict to map (thanks to piR for the improvement!) -
df2.name.map(dict(df1.values))
Or, replace, but this is slower -
df2.name.replace(df1.set_index('id').name)
0 Sara
1 Johnny
2 Sara
Name: name, dtype: object

Categories

Resources