Adding new column to merged DataFrame based on pre-merged DataFrames - python

I have two DataFrames, df1 and df2. In my code I used Pandas.concat method to find the differences between them.
df1 = pd.read_excel(latest_file, 0)
df2 = pd.read_excel(latest_file, 1)
#Reads first and second sheet inside spreadsheet.
new_dataframe = pd.concat([df1,df2]).drop_duplicates(keep=False)
This works perfectly, however I want to know which rows are coming from df1, and which are coming from df2. to show this I want to add a column to new_dataframe, if it's from df1 to say "Removed" in the new column, and to say 'Added' if it's from df2. I can't seem to find any documentation on how to do this. Thanks in advance for any help.
Edit: In my current code it removed all columns which are identical in each DataFrame. The solution has to still remove the common rows.

Consider using pd.merge with indicator=True instead. This will create a new column named _merge that indicates which value came from which column. You can modify this to say Removed and Added
df1 = pd.DataFrame({'col1': [1,2,3,4,5]})
df2 = pd.DataFrame({'col1': [3,4,5,6,7]})
m = {'left_only': 'Removed', 'right_only': 'Added'}
new_dataframe = pd.merge(df1, df2, how='outer', indicator=True) \
.query('_merge != "both"') \
.replace({'_merge': m})
Output:
col1 _merge
0 1 Removed
1 2 Removed
5 6 Added
6 7 Added

Related

Delete row indices based on common columns in a Dataframe

I have following two dataframes df1 and df2
final raw st
abc 12 10
abc 17 15
abc 14 17
and
final raw
abc 12
abc 14
My expected output is
final raw st
abc 17 15
I would like to delete rows based on common column value.
My try:
df1.isin(df2)
This is giving me Boolean result. Another thing, I tried
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how = 'inner') so that we get all the common columns for df1 and df3.
You are closed with merge you just need extra step. First you need to perform an outer join to keep all rows from both dataframes and enable indicator of merge then filter on this indicator to keep right values (from df2). Finally, keep only columns from df1:
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how='outer', indicator=True) \
.query("_merge == 'left_only'")[df1.columns]
print(df3)
# Output
final raw st
1 abc 17 15
You need to refer to the correct column when using isin.
result = df1[~df1['raw'].isin(df2['raw'])]

Generating a dataframe based off the diff between two dataframes

I have 2 data frames that look like this
Df1
City Code ColA Col..Z
LA LAA
LA LAB
LA LAC
Df2
Code ColA Col..Z
LA LAA
NY NYA
CH CH1
What I'm trying to do have the result of
df3
Code ColA Col..Z
NY NYA
CH CH1
Normally I would loop through each row in df2 and say:
Df3 = If df2.row['Code'] in df1 then drop it.
But I want to find a pythonic pandas way to do it instead of looping through the dataframe. I was looking at examples using joins or merging but I cant seem to work it out.
This Df3 = If df2.row['Code'] in df1 then drop it. translates to
df3 = df2[~df2['Code'].isin(df1['City'] ]
To keep only the different items in df2 based on the code column, you can do something like this, using drop_duplicates :
df2[df2.code.isin(
# the different values in df2's 'code' column
pd.concat([df1.code, df2.code]).drop_duplicates(keep=False)
)]
There is a pandas compare df method which might be relevant?:
df1 = pd.read_clipboard()
df1
df2 = pd.read_clipboard()
df2
df1.compare(df2).drop('self', axis=1, level=1).droplevel(1, axis=1)
(And I'm making an assumption you had a typo in your dataframes with the City col missing from df2?)

How to merge two different dataframe with different columns

As someone who is super new in merge/append on Python, I am trying to merge two different DF together.
DF1 has 2 columns with Text and ID columns and 100 rows
DF2 has 3 columns with Text, ID, and Match columns and has 20 rows
My goal is to combine the two DFs together so the "Match" column from DF2 can be merged into DF1.
The Match column is all "True" value, so when it gets merged over the other 80 rows on DF1 can be NaN and I can fix it later.
Thank you to everyone for the help and support!
Try a left merge using .merge(), like this:
DF_out = DF1.merge(DF2, on=['Text', 'ID'], how='left')

Check Series label does not exist in a separate DataFrame

I'm iterating over two separate dataframes, where one dataframe is a subset of the other. I need to ensure that only the columns in the set (df1) which are not contained in the subset (df2) pass the conditional statement.
In this case, it would be comparing the Series object during each iteration in df1 to the dataframe, df2. Ideally I would like to compare just the labels associated with each column, not the values contained in the columns. My code below. Any help would be greatly appreciated!
for i in df1:
for j in df2:
if df1[i] is not in df2:
...do some stuff between df1[i] and df2[j]
To find out if the values of df1 are in df2 you can use:
df1.isin(df2)
To find all values in df1 that are not in df2 you can use:
df1[~df1.isin(df2)]
The values that are in df1 and df2 will be a nan in this case

Why does Pandas inner join give ValueError: len(left_on) must equal the number of levels in the index of "right"?

I'm trying to inner join DataFrame A to DataFrame B and am running into an error.
Here's my join statement:
merged = DataFrameA.join(DataFrameB, on=['Code','Date'])
And here's the error:
ValueError: len(left_on) must equal the number of levels in the index of "right"
I'm not sure the column order matters (they aren't truly "ordered" are they?), but just in case, the DataFrames are organized like this:
DataFrameA: Code, Date, ColA, ColB, ColC, ..., ColG, ColH (shape: 80514, 8 - no index)
DataFrameB: Date, Code, Col1, Col2, Col3, ..., Col15, Col16 (shape: 859, 16 - no index)
Do I need to correct my join statement? Or is there another, better way to get the intersection (or inner join) of these two DataFrames?
use merge if you are not joining on the index:
merged = pd.merge(DataFrameA,DataFrameB, on=['Code','Date'])
Follow up to question below:
Here is a reproducible example:
import pandas as pd
# create some timestamps for date column
i = pd.to_datetime(pd.date_range('20140601',periods=2))
#create two dataframes to merge
df = pd.DataFrame({'code': ['ABC','EFG'], 'date':i,'col1': [10,100]})
df2 = pd.DataFrame({'code': ['ABC','EFG'], 'date':i,'col2': [10,200]})
#merge on columns (default join is inner)
pd.merge(df, df2, on =['code','date'])
This results is:
code col1 date col2
0 ABC 10 2014-06-01 10
1 EFG 100 2014-06-02 200
What happens when you run this code?
Here is another way of performing join. Unlike the answer verified, this is a more general answer applicable to all other types of join.
Inner Join
inner join can also be performed by explicitly mentioning it as follows in how:
pd.merge(df1, df2, on='filename', how='inner')
The same methodology aplies for the other types of join:
OuterJoin
pd.merge(df1, df2, on='filename', how='outer')
Left Join
pd.merge(df1, df2, on='filename', how='left')
Right Join
pd.merge(df1, df2, on='filename', how='right')

Categories

Resources