Compare 2 dataframes on specific columns

Compare 2 dataframes on specific columns - python

So, I have 2 dataframes like:
DataframeA:
ID,CLASS,DIVISION
1,123,3G
2,456,5G
3,123,4G
DataframeB:
ID,CLASS,DIVISION
1,123,3G
2,456,4G
I would like to substract DataframeA from DataframeB such that, only the records that are in DataframeA and not in DataframeB should be present. But the comparison should be on CLASS and DIVISION columns only.
Expected Output:
ID,CLASS,DIVISION
2,456,5G
3,123,4G
Now I can do a Left-Join between DataframeA and DataframeB on [CLASS, DIVISION] and then select only the isNull values of CLASS, DIVISION columns of DataframeB like so:
new_df = pd.merge(DataframeA, DataframeB, how='left', left_on=fileA_headerList, right_on=fileB_headerList)
new_df = new_df[new_df[fileB_headerList].isnull().all(axis=1)]
But I would like to know whether there's a more Elegant or Pythonic way to do this.

you can use pd.merge() with indicator=True
res = pd.merge(df1,df2[['CLASS','DIVISION']],on=['CLASS','DIVISION'],how='outer',indicator=True)
res =res[res['_merge']=='left_only'].drop(['_merge'],axis=1)
print(res)
ID CLASS DIVISION
1 2.0 456 5G
2 3.0 123 4G

With left join (df1 - left frame, df2 - right frame) and filtering out matched rows:
In [1157]: df3 = df1.merge(df2, on=df1.columns.drop('ID').tolist(), how='left', suffixes=('', '_'))
In [1158]: df3[df3['ID_'].isna()].drop('ID_', axis=1)
Out[1158]:
ID CLASS DIVISION
1 2 456 5G
2 3 123 4G

Related

Delete row indices based on common columns in a Dataframe

I have following two dataframes df1 and df2
final raw st
abc 12 10
abc 17 15
abc 14 17
and
final raw
abc 12
abc 14
My expected output is
final raw st
abc 17 15
I would like to delete rows based on common column value.
My try:
df1.isin(df2)
This is giving me Boolean result. Another thing, I tried
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how = 'inner') so that we get all the common columns for df1 and df3.

You are closed with merge you just need extra step. First you need to perform an outer join to keep all rows from both dataframes and enable indicator of merge then filter on this indicator to keep right values (from df2). Finally, keep only columns from df1:
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how='outer', indicator=True) \
.query("_merge == 'left_only'")[df1.columns]
print(df3)
# Output
final raw st
1 abc 17 15

You need to refer to the correct column when using isin.
result = df1[~df1['raw'].isin(df2['raw'])]

Joining or merging two DataFrames based on index and column value

I have two DataFrames as follow:
df1 = pd.DataFrame({'Group1': [0.5,5,3], 'Group2' : [2,.06,0.9]}, index=['Dan','Max','Joe'])
df2 = pd.DataFrame({'Name' : ['Joe','Max'], 'Team' : ['Group2','Group1']})
My goal is to get the right value for the Name of the person considering the the column 'Team'.
So the result should look something like this:
I tried it with a merge but I failed because I don't know how to merge on these conditions.
What's the best way in Python to reach my goal?

You can unstack df1, reset its indices, rename columns and merge on Name and Team:
out = (df1.unstack()
.reset_index()
.rename({'level_0':'Team', 'level_1':'Name', 0:'Value'}, axis=1)
.merge(df2, on=['Name','Team']))
Output:
Team Name 0
0 Group1 Max 5.0
1 Group2 Joe 0.9

Combining two dataframes with same column

I have two dataframes.
feelingsDF with columns 'feeling', 'count', 'code'.
countryDF with columns 'feeling', 'countryCount'.
How do I make another dataframe that takes the columns from countryDF and combines it with the code column in feelingsDF?
I'm guessing you would need to somehow use same feeling column in feelingsDF to combine them and match sure the same code matches the same feeling.
I want the three columns to appear as:
[feeling][countryCount][code]

You are joining the two dataframes by the column 'feeling'. Assuming you only want the entries in 'feeling' that are common to both dataframes, you would want to do an inner join.
Here is a similar example with two dfs:
x = pd.DataFrame({'feeling': ['happy', 'sad', 'angry', 'upset', 'wow'], 'col1': [1,2,3,4,5]})
y = pd.DataFrame({'feeling': ['okay', 'happy', 'sad', 'not', 'wow'], 'col2': [20,23,44,10,15]})
x.merge(y,how='inner', on='feeling')
Output:
feeling col1 col2
0 happy 1 23
1 sad 2 44
2 wow 5 15
To drop the 'count' column, select the other columns of feelingsDF, and then sort by the 'countryCount' column. Note that this will leave your index out of order, but you can reindex the combined_df afterwards.
combined_df = feelingsDF[['feeling', 'code']].merge(countryDF, how='inner', on='feeling').sort_values('countryCount')
# To reset the index after sorting:
combined_df = combined_df.reset_index(drop=True)

You can join two dataframes using pd.merge. Assuming that you want to join on the feeling column, you can use:
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
See documentation for pd.merge to understand how to use the on and how parameters.
feelingsDF = pd.DataFrame([{'feeling':1,'count':10,'code':'X'},
{'feeling':2,'count':5,'code':'Y'},{'feeling':3,'count':1,'code':'Z'}])
feeling count code
0 1 10 X
1 2 5 Y
2 3 1 Z
countryDF = pd.DataFrame([{'feeling':1,'country':'US'},{'feeling':2,'country':'UK'},{'feeling':3,'country':'DE'}])
feeling country
0 1 US
1 2 UK
2 3 DE
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
feeling count code country
0 1 10 X US
1 2 5 Y UK
2 3 1 Z DE

Pandas compare two dataframes and remove what matches in one column

I have two separate pandas dataframes (df1 and df2) which have multiple columns, but only one in common ('text').
I would like to do find every row in df2 that does not have a match in any of the rows of the column that df2 and df1 have in common.
df1
A B text
45 2 score
33 5 miss
20 1 score
df2
C D text
.5 2 shot
.3 2 shot
.3 1 miss
Result df (remove row containing miss since it occurs in df1)
C D text
.5 2 shot
.3 2 shot
Is it possible to use the isin method in this scenario?

As you asked, you can do this efficiently using isin (without resorting to expensive merges).
>>> df2[~df2.text.isin(df1.text.values)]
C D text
0 0.5 2 shot
1 0.3 2 shot

You can merge them and keep only the lines that have a NaN.
df2[pd.merge(df1, df2, how='outer').isnull().any(axis=1)]
or you can use isin:
df2[~df2.text.isin(df1.text)]

EDIT:
import numpy as np
mergeddf = pd.merge(df2,df1, how="left")
result = mergeddf[(np.isnan(mergeddf['A']))][['C','D','text']]

Pandas: combine data frames of different sizes

I have 2 data frames:
df1 has ID and count of white products
product_id, count_white
12345,4
23456,7
34567,1
df2 has IDs and counts of all products
product_id,total_count
0009878,14
7862345,20
12345,10
456346,40
23456,30
0987352,10
34567,90
df2 has more products than df1. I need to search df2 for products that are in df1 and add total_count column to df1:
product_id,count_white,total_count
12345,4,10
23456,7,30
34567,1,90
I could do a left merge, but I would end up with a huge file. Is there any way to add specific rows from df2 to df1 using merge?

Just perform a left merge on 'product_id' column:
In [12]:
df.merge(df1, on='product_id', how='left')
Out[12]:
product_id count_white total_count
0 12345 4 10
1 23456 7 30
2 34567 1 90

Perform left join/merge:
Data frames are:
left join:
df1=df1.merge(df2, on='product_id', how='left')
The output will look like this:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare 2 dataframes on specific columns - python

you can use pd.merge() with indicator=True res = pd.merge(df1,df2[['CLASS','DIVISION']],on=['CLASS','DIVISION'],how='outer',indicator=True) res =res[res['_merge']=='left_only'].drop(['_merge'],axis=1) print(res) ID CLASS DIVISION 1 2.0 456 5G 2 3.0 123 4G

With left join (df1 - left frame, df2 - right frame) and filtering out matched rows: In [1157]: df3 = df1.merge(df2, on=df1.columns.drop('ID').tolist(), how='left', suffixes=('', '_')) In [1158]: df3[df3['ID_'].isna()].drop('ID_', axis=1) Out[1158]: ID CLASS DIVISION 1 2 456 5G 2 3 123 4G

Related

Delete row indices based on common columns in a Dataframe

Joining or merging two DataFrames based on index and column value

Combining two dataframes with same column

Pandas compare two dataframes and remove what matches in one column

Pandas: combine data frames of different sizes

Categories

Resources