Generating a dataframe based off the diff between two dataframes - python

I have 2 data frames that look like this
Df1
City Code ColA Col..Z
LA LAA
LA LAB
LA LAC
Df2
Code ColA Col..Z
LA LAA
NY NYA
CH CH1
What I'm trying to do have the result of
df3
Code ColA Col..Z
NY NYA
CH CH1
Normally I would loop through each row in df2 and say:
Df3 = If df2.row['Code'] in df1 then drop it.
But I want to find a pythonic pandas way to do it instead of looping through the dataframe. I was looking at examples using joins or merging but I cant seem to work it out.

This Df3 = If df2.row['Code'] in df1 then drop it. translates to
df3 = df2[~df2['Code'].isin(df1['City'] ]

To keep only the different items in df2 based on the code column, you can do something like this, using drop_duplicates :
df2[df2.code.isin(
# the different values in df2's 'code' column
pd.concat([df1.code, df2.code]).drop_duplicates(keep=False)
)]

There is a pandas compare df method which might be relevant?:
df1 = pd.read_clipboard()
df1
df2 = pd.read_clipboard()
df2
df1.compare(df2).drop('self', axis=1, level=1).droplevel(1, axis=1)
(And I'm making an assumption you had a typo in your dataframes with the City col missing from df2?)

Related

Adding new column to merged DataFrame based on pre-merged DataFrames

I have two DataFrames, df1 and df2. In my code I used Pandas.concat method to find the differences between them.
df1 = pd.read_excel(latest_file, 0)
df2 = pd.read_excel(latest_file, 1)
#Reads first and second sheet inside spreadsheet.
new_dataframe = pd.concat([df1,df2]).drop_duplicates(keep=False)
This works perfectly, however I want to know which rows are coming from df1, and which are coming from df2. to show this I want to add a column to new_dataframe, if it's from df1 to say "Removed" in the new column, and to say 'Added' if it's from df2. I can't seem to find any documentation on how to do this. Thanks in advance for any help.
Edit: In my current code it removed all columns which are identical in each DataFrame. The solution has to still remove the common rows.
Consider using pd.merge with indicator=True instead. This will create a new column named _merge that indicates which value came from which column. You can modify this to say Removed and Added
df1 = pd.DataFrame({'col1': [1,2,3,4,5]})
df2 = pd.DataFrame({'col1': [3,4,5,6,7]})
m = {'left_only': 'Removed', 'right_only': 'Added'}
new_dataframe = pd.merge(df1, df2, how='outer', indicator=True) \
.query('_merge != "both"') \
.replace({'_merge': m})
Output:
col1 _merge
0 1 Removed
1 2 Removed
5 6 Added
6 7 Added

How to compare two columns in different pandas dataframes, store the differences in a 3rd dataframe

I need to compare two df1 (blue) and df2 (orange), store only the rows of df2 (orange) that are not in df1 in a separate data frame, and then add that to df1 while assigning function 6 and sector 20 for the employees that were not present in df1 (blue)
I know how to find the differences between the data frames and store that in a third data frame, but I'm stuck trying to figure out how to store only the rows of df2 that are not in df1.
Can try this:
Get a list with the data os orange u want to keep
Filter df2 with that list
Append
df1 --> blue, df2 --> orange
import pandas as pd
df2['Function'] = 6
df2['Sector'] = 20
ids_df2_keep = [e for e in df2['ID'] if e not in list(df1['ID'])]
df2 = df2[df2['ID'].isin(ids_df2_keep)
df1 = df1.append(df2)
This has been answered in pandas get rows which are NOT in other dataframe
Store it as a merge and simply select the rows that do not share common values.
~ negates the expression, select all that are NOT IN instead of IN.
common = df1.merge(df2,on=['ID','Name'])
df = df2[(~df2['ID'].isin(common['ID']))&(~df2['Name'].isin(common['Name']))]
This was tested using some of your data:
df1 = pd.DataFrame({'ID':[125,134,156],'Name':['John','Mary','Bill'],'func':[1,2,2]})
df2 = pd.DataFrame({'ID':[125,139,133],'Name':['John','Joana','Linda']})
Output is:
ID Name
1 139 Joana
2 133 Linda

Joining dataframe by index. but no data on output, however the column names are joined

I am trying to join 2 dataframes by same index as the first column in both dataframes using python. The code is below:
combined_data = pd.merge(df1, df2, right_index=True, left_index=True)
df1 has columns:
colA, colB
And df2 has:
colA, colC, colD, colE
the output is:
colA, colB, colC, colD, colE
with no data below it. It just gives the joined columns
NOTE: The df has about 4800 rows and df2 has 4600 rows
Could large data be a problem. Or there is something else wrong?
The problem was due to a different data type for the same common column in two dataframes.
this can be resolved by:
df1['colA'] = df1['colA'].astype(int)
df2['colA'] = df2['colA'].astype(int)#to ensure both are int type.
after this the code works like charm!.

Repeatedly Change Index in DataFrames In Pandas

I have 3 dataframes:
df1 :
ip name
df2 :
name country
df3:
country city
I have to match them by IP. What is correct way to do this? We match them df1 and df2 and then match result of df1 and df2 with df3 with index change. I think that ist not correct way.
It seems you need double merge, parameter on should me omit if only same joined columns of dfs:
df = df1.merge(df2).merge(df3)

How may I merge a dataframe with repeated index entries with one with unique index entries?

I have two data frames as so
One of those dataframes has an index that is repeated and I would like to join them with another dataframe in which that is not so. For example
Dataframe I=
[ index column1]
leb Lebanon
iso iso1
CAN Canda
DataFrame I2=
[ index column1]
leb ra
CAN ba
CAN gell
I want to merge them such that
Dataframe Itot=
DataFrame I2=
[ index column1 column2]
leb ra Lebanon
CAN ba Canada
CAN gell canada
It is a many-to-one merge in Stata, as can be seen in
http://www.stata.com/manuals13/dmerge.pdf p.7.
Consider this DataFrame
df = pd.DataFrame({'Year': [2010,2009,2008],
'population_A': ['101597.0', '101416.0', '101342.0'],
'Country':['Aruba', 'Aruba', 'Aruba']})
df = df.set_index(['Country'])
df
Consider the other non-repetitive dataframe to be
df1 = pd.DataFrame({'Country':['Aruba','Afghanistan','Africa','Lebanon'], 'iso3c':['ABW','AFG','AFR','LEB']})
df1 = df1.set_index(['Country'])
df1
To do so we need the first dataframe to have a different index than country say id 0,1,2, ...
df.reset_index(level=0, inplace=True)
df
df.join(df1, on='Country')
this is it

Categories

Resources