Pandas: combine data frames of different sizes - python

I have 2 data frames:
df1 has ID and count of white products
product_id, count_white
12345,4
23456,7
34567,1
df2 has IDs and counts of all products
product_id,total_count
0009878,14
7862345,20
12345,10
456346,40
23456,30
0987352,10
34567,90
df2 has more products than df1. I need to search df2 for products that are in df1 and add total_count column to df1:
product_id,count_white,total_count
12345,4,10
23456,7,30
34567,1,90
I could do a left merge, but I would end up with a huge file. Is there any way to add specific rows from df2 to df1 using merge?

Just perform a left merge on 'product_id' column:
In [12]:
df.merge(df1, on='product_id', how='left')
Out[12]:
product_id count_white total_count
0 12345 4 10
1 23456 7 30
2 34567 1 90

Perform left join/merge:
Data frames are:
left join:
df1=df1.merge(df2, on='product_id', how='left')
The output will look like this:

Related

How to compare two columns in different pandas dataframes, store the differences in a 3rd dataframe

I need to compare two df1 (blue) and df2 (orange), store only the rows of df2 (orange) that are not in df1 in a separate data frame, and then add that to df1 while assigning function 6 and sector 20 for the employees that were not present in df1 (blue)
I know how to find the differences between the data frames and store that in a third data frame, but I'm stuck trying to figure out how to store only the rows of df2 that are not in df1.
Can try this:
Get a list with the data os orange u want to keep
Filter df2 with that list
Append
df1 --> blue, df2 --> orange
import pandas as pd
df2['Function'] = 6
df2['Sector'] = 20
ids_df2_keep = [e for e in df2['ID'] if e not in list(df1['ID'])]
df2 = df2[df2['ID'].isin(ids_df2_keep)
df1 = df1.append(df2)
This has been answered in pandas get rows which are NOT in other dataframe
Store it as a merge and simply select the rows that do not share common values.
~ negates the expression, select all that are NOT IN instead of IN.
common = df1.merge(df2,on=['ID','Name'])
df = df2[(~df2['ID'].isin(common['ID']))&(~df2['Name'].isin(common['Name']))]
This was tested using some of your data:
df1 = pd.DataFrame({'ID':[125,134,156],'Name':['John','Mary','Bill'],'func':[1,2,2]})
df2 = pd.DataFrame({'ID':[125,139,133],'Name':['John','Joana','Linda']})
Output is:
ID Name
1 139 Joana
2 133 Linda

Delete row indices based on common columns in a Dataframe

I have following two dataframes df1 and df2
final raw st
abc 12 10
abc 17 15
abc 14 17
and
final raw
abc 12
abc 14
My expected output is
final raw st
abc 17 15
I would like to delete rows based on common column value.
My try:
df1.isin(df2)
This is giving me Boolean result. Another thing, I tried
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how = 'inner') so that we get all the common columns for df1 and df3.
You are closed with merge you just need extra step. First you need to perform an outer join to keep all rows from both dataframes and enable indicator of merge then filter on this indicator to keep right values (from df2). Finally, keep only columns from df1:
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how='outer', indicator=True) \
.query("_merge == 'left_only'")[df1.columns]
print(df3)
# Output
final raw st
1 abc 17 15
You need to refer to the correct column when using isin.
result = df1[~df1['raw'].isin(df2['raw'])]

How to merge two different dataframe with different columns

As someone who is super new in merge/append on Python, I am trying to merge two different DF together.
DF1 has 2 columns with Text and ID columns and 100 rows
DF2 has 3 columns with Text, ID, and Match columns and has 20 rows
My goal is to combine the two DFs together so the "Match" column from DF2 can be merged into DF1.
The Match column is all "True" value, so when it gets merged over the other 80 rows on DF1 can be NaN and I can fix it later.
Thank you to everyone for the help and support!
Try a left merge using .merge(), like this:
DF_out = DF1.merge(DF2, on=['Text', 'ID'], how='left')

Compare 2 dataframes on specific columns

So, I have 2 dataframes like:
DataframeA:
ID,CLASS,DIVISION
1,123,3G
2,456,5G
3,123,4G
DataframeB:
ID,CLASS,DIVISION
1,123,3G
2,456,4G
I would like to substract DataframeA from DataframeB such that, only the records that are in DataframeA and not in DataframeB should be present. But the comparison should be on CLASS and DIVISION columns only.
Expected Output:
ID,CLASS,DIVISION
2,456,5G
3,123,4G
Now I can do a Left-Join between DataframeA and DataframeB on [CLASS, DIVISION] and then select only the isNull values of CLASS, DIVISION columns of DataframeB like so:
new_df = pd.merge(DataframeA, DataframeB, how='left', left_on=fileA_headerList, right_on=fileB_headerList)
new_df = new_df[new_df[fileB_headerList].isnull().all(axis=1)]
But I would like to know whether there's a more Elegant or Pythonic way to do this.
you can use pd.merge() with indicator=True
res = pd.merge(df1,df2[['CLASS','DIVISION']],on=['CLASS','DIVISION'],how='outer',indicator=True)
res =res[res['_merge']=='left_only'].drop(['_merge'],axis=1)
print(res)
ID CLASS DIVISION
1 2.0 456 5G
2 3.0 123 4G
With left join (df1 - left frame, df2 - right frame) and filtering out matched rows:
In [1157]: df3 = df1.merge(df2, on=df1.columns.drop('ID').tolist(), how='left', suffixes=('', '_'))
In [1158]: df3[df3['ID_'].isna()].drop('ID_', axis=1)
Out[1158]:
ID CLASS DIVISION
1 2 456 5G
2 3 123 4G

compare two seperate pandas dataframes row by row and return matching values

I have two pandas data frames df1 and df2. df1 contains 2 columns and 750 rows, df2 has 2 columns and 88 rows. I want to compare the two data frames and return the values from df1 that are present in df2 and store the matching values in a new column in df2.
Ex.
df1
A B
emp_table emp_id
emp_table emp_name
pay_table basic_amount
pay_table da_amount
df2
A B
emp_table emp_id
emp_table emp_department
pay_table da_amount
I want to add another column in df2 which has the matching values.
df2
A B
emp_table emp_id
pay_table da_amount
I want to perform one to many comparison of each element of df1 with each element of df2.
I think you need merge without parameter on, so all columns are joined:
df = pd.merge(df1, df2)
print (df)
A B
0 emp_table emp_id
1 pay_table da_amount

Categories

Resources