Find difference between two data frames

Find difference between two data frames - python

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?
In other word, a data frame that has all the rows/columns in df1 that are not in df2?

By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only

For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)
The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: "left_only", "right_only" or "both".
For columns, try this:
set(df1.columns).symmetric_difference(df2.columns)

Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by
df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]
It's slower, because it needs to cast data to string, but thanks to this casting pd.np.nan == pd.np.nan.
Let's go trough the code. First we cast values to string, and apply tuple function to each row.
df1.astype(str).apply(tuple, 1)
df2.astype(str).apply(tuple, 1)
Thanks to that, we get pd.Series object with list of tuples. Each tuple contains whole row from df1/df2.
Then we apply isin method on df1 to check if each tuple "is in" df2.
The result is pd.Series with bool values. True if tuple from df1 is in df2. In the end, we negate results with ~ sign, and applying filter on df1. Long story short, we get only those rows from df1 that are not in df2.
To make it more readable, we may write it as:
df1_str_tuples = df1.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df1_values_in_df2_filter = df1_str_tuples.isin(df2_str_tuples)
df1_values_not_in_df2 = df1[~df1_values_in_df2_filter]

import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
'Age':[23,12,34,44,28,40]})
# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)
# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)
# df1
# Age Name
# 0 23 John
# 1 45 Mike
# 2 12 Smith
# 3 34 Wale
# 4 27 Marry
# 5 44 Tom
# 6 28 Menda
# 7 39 Bolt
# 8 40 Yuswa
# df2
# Age Name
# 0 23 John
# 1 12 Smith
# 2 34 Wale
# 3 44 Tom
# 4 28 Menda
# 5 40 Yuswa
# df_1notin2
# Age Name
# 0 45 Mike
# 1 27 Marry
# 2 39 Bolt

Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.
newDf = df1.set_index('Name1')
.drop(df2['Name2'], errors='ignore')
.reset_index(drop=False)

edit2, I figured out a new solution without the need of setting index
newdf=pd.concat([df1,df2]).drop_duplicates(keep=False)
Okay i found the answer of highest vote already contain what I have figured out. Yes, we can only use this code on condition that there are no duplicates in each two dfs.
I have a tricky method. First we set ’Name’ as the index of two dataframe given by the question. Since we have same ’Name’ in two dfs, we can just drop the ’smaller’ df’s index from the ‘bigger’ df.
Here is the code.
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)

Pandas now offers a new API to do data frame diff: pandas.DataFrame.compare
df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0

In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set difference of two dataframes with any index/columns (they might not coincide for both datarames). Also method allows to setup tolerance for float elements for dataframe comparison (it uses np.isclose)
import numpy as np
import pandas as pd
def get_dataframe_setdiff2d(df_new: pd.DataFrame,
df_old: pd.DataFrame,
rtol=1e-03, atol=1e-05) -> pd.DataFrame:
"""Returns set difference of two pandas DataFrames"""
union_index = np.union1d(df_new.index, df_old.index)
union_columns = np.union1d(df_new.columns, df_old.columns)
new = df_new.reindex(index=union_index, columns=union_columns)
old = df_old.reindex(index=union_index, columns=union_columns)
mask_diff = ~np.isclose(new, old, rtol, atol)
df_bool = pd.DataFrame(mask_diff, union_index, union_columns)
df_diff = pd.concat([new[df_bool].stack(),
old[df_bool].stack()], axis=1)
df_diff.columns = ["New", "Old"]
return df_diff
Example:
In [1]
df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})
print("df1:\n", df1, "\n")
print("df2:\n", df2, "\n")
diff = get_dataframe_setdiff2d(df1, df2)
print("diff:\n", diff, "\n")
Out [1]
df1:
A C
0 2 2
1 1 1
2 2 2
df2:
A B
0 1 1
1 1 1
diff:
New Old
0 A 2.0 1.0
B NaN 1.0
C 2.0 NaN
1 B NaN 1.0
C 1.0 NaN
2 A 2.0 NaN
C 2.0 NaN

As mentioned here
that
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
is correct solution but it will produce wrong output if
df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
In that case above solution will give
Empty DataFrame, instead you should use concat method after removing duplicates from each datframe.
Use concate with drop_duplicates
df1=df1.drop_duplicates(keep="first")
df2=df2.drop_duplicates(keep="first")
pd.concat([df1,df2]).drop_duplicates(keep=False)

I had issues with handling duplicates when there were duplicates on one side and at least one on the other side, so I used Counter.collections to do a better diff, ensuring both sides have the same count. This doesn't return duplicates, but it won't return any if both sides have the same count.
from collections import Counter
def diff(df1, df2, on=None):
"""
:param on: same as pandas.df.merge(on) (a list of columns)
"""
on = on if on else df1.columns
df1on = df1[on]
df2on = df2[on]
c1 = Counter(df1on.apply(tuple, 'columns'))
c2 = Counter(df2on.apply(tuple, 'columns'))
c1c2 = c1-c2
c2c1 = c2-c1
df1ondf2on = pd.DataFrame(list(c1c2.elements()), columns=on)
df2ondf1on = pd.DataFrame(list(c2c1.elements()), columns=on)
df1df2 = df1.merge(df1ondf2on).drop_duplicates(subset=on)
df2df1 = df2.merge(df2ondf1on).drop_duplicates(subset=on)
return pd.concat([df1df2, df2df1])
> df1 = pd.DataFrame({'a': [1, 1, 3, 4, 4]})
> df2 = pd.DataFrame({'a': [1, 2, 3, 4, 4]})
> diff(df1, df2)
a
0 1
0 2

There is a new method in pandas DataFrame.compare that compare 2 different dataframes and return which values changed in each column for the data records.
Example
First Dataframe
Id Customer Status Date
1 ABC Good Mar 2023
2 BAC Good Feb 2024
3 CBA Bad Apr 2022
Second Dataframe
Id Customer Status Date
1 ABC Bad Mar 2023
2 BAC Good Feb 2024
5 CBA Good Apr 2024
Comparing Dataframes
print("Dataframe difference -- \n")
print(df1.compare(df2))
print("Dataframe difference keeping equal values -- \n")
print(df1.compare(df2, keep_equal=True))
print("Dataframe difference keeping same shape -- \n")
print(df1.compare(df2, keep_shape=True))
print("Dataframe difference keeping same shape and equal values -- \n")
print(df1.compare(df2, keep_shape=True, keep_equal=True))
Result
Dataframe difference --
Id Status Date
self other self other self other
0 NaN NaN Good Bad NaN NaN
2 3.0 5.0 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping equal values --
Id Status Date
self other self other self other
0 1 1 Good Bad Mar 2023 Mar 2023
2 3 5 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape --
Id Customer Status Date
self other self other self other self other
0 NaN NaN NaN NaN Good Bad NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 3.0 5.0 NaN NaN Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape and equal values --
Id Customer Status Date
self other self other self other self other
0 1 1 ABC ABC Good Bad Mar 2023 Mar 2023
1 2 2 BAC BAC Good Good Feb 2024 Feb 2024
2 3 5 CBA CBA Bad Good Apr 2022 Apr 2024

A slight variation of the nice #liangli's solution that does not require to change the index of existing dataframes:
newdf = df1.drop(df1.join(df2.set_index('Name').index))

Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting
df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
# Example
df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])
df2 = df1.loc[[1,3,5]]
df1
gender subject
1 f bio
2 m chem
3 f phy
4 m bio
5 f bio
df2
gender subject
1 f bio
3 f phy
5 f bio
df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
df3
gender subject
2 m chem
4 m bio

Defining our dataframes:
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])
df1
Name Age
0 John 23
1 Mike 45
2 Smith 12
3 Wale 34
4 Marry 27
5 Tom 44
6 Menda 28
7 Bolt 39
8 Yuswa 40
df2
Name Age
0 John 23
2 Smith 12
3 Wale 34
5 Tom 44
6 Menda 28
8 Yuswa 40
The difference between the two would be:
df1[~df1.isin(df2)].dropna()
Name Age
1 Mike 45.0
4 Marry 27.0
7 Bolt 39.0
Where:
df1.isin(df2) returns the rows in df1 that are also in df2.
~ (Element-wise logical NOT) in front of the expression negates the results, so we get the elements in df1 that are NOT in df2–the difference between the two.
.dropna() drops the rows with NaN presenting the desired output
Note This only works if len(df1) >= len(df2). If df2 is longer than df1 you can reverse the expression: df2[~df2.isin(df1)].dropna()

I found the deepdiff library is a wonderful tool that also extends well to dataframes if different detail is required or ordering matters. You can experiment with diffing to_dict('records'), to_numpy(), and other exports:
import pandas as pd
from deepdiff import DeepDiff
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])]
DeepDiff(df1.to_dict(), df2.to_dict())
# {'dictionary_item_removed': [root['Name'][1], root['Name'][4], root['Name'][7], root['Age'][1], root['Age'][4], root['Age'][7]]}

Symmetric Difference
If you are interested in the rows that are only in one of the dataframes but not both, you are looking for the set difference:
pd.concat([df1,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.
Set Difference / Relational Algebra Difference
If you are interested in the relational algebra difference / set difference, i.e. df1-df2 or df1\df2:
pd.concat([df1,df2,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.

Another possible solution is to use numpy broadcasting:
df1[np.all(~np.all(df1.values == df2.values[:, None], axis=2), axis=0)]
Output:
Name Age
1 Mike 45
4 Marry 27
7 Bolt 39

Using the lambda function you can filter the rows with _merge value “left_only” to get all the rows in df1 which are missing from df2
df3 = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x :x['_merge']=='left_only']
df

Try this one:
df_new = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', 1)
It will result a new dataframe with the differences: the values that exist in df1 but not in df2.

Related

Pandas: Aggregating and transposing dataframe based on string

I have a dataframe that tracks the changes of an object which is identified by an id. Instead of each row representing a change of state, I want one row for each object and all of the changes tracked in columns instead.
import pandas as pd
import numpy as np
df1=pd.DataFrame({'ID':['1','2','3','1','2','1','4'], 'Original_Status':['Admitted','Admitted','Admitted','Probation','LateAdmission','Admitted','Admitted'],'New_Status':['Probation','LateAdmission','Pass','Admitted','Pass','Pass','Fail']})
df2=pd.DataFrame({'ID':['1','2','3','4'],'Original_Status_1':['Admitted','Admitted','Admitted','Admitted'],'New_Status_1':['Probation','LateAdmission','Pass','Fail'],'Original_Status_2':['Probation','LateAdmission',np.nan,np.nan],'New_Status_2':['Admitted','Pass',np.nan,np.nan],'Original_Status_3':['Admitted',np.nan,np.nan,np.nan],'New_Status_3':['Pass',np.nan,np.nan,np.nan],})`
ID Original_Status New_Status
0 1 Admitted Probation
1 2 Admitted LateAdmission
2 3 Admitted Pass
3 1 Probation Admitted
4 2 LateAdmission Pass
5 1 Admitted Pass
6 4 Admitted Fail
Original Dataframe
Change to:
ID Original_Status_1 New_Status_1 Original_Status_2 New_Status_2 Original_Status_3 New_Status_3
0 1 Admitted Probation Probation Admitted Admitted Pass
1 2 Admitted LateAdmission LateAdmission Pass NaN NaN
2 3 Admitted Pass NaN NaN NaN NaN
3 4 Admitted Fail NaN NaN NaN NaN
New Dataframe
I was able to achieve this outcome using a loops, but I'd prefer a more succint solution if possible.

This adds a columns to df1 to count the occurrence of the 'ID', then uses pd.pivot to make the wide df with a multi-index columns. The steps after the pivot are to flatten the column names and to order them correctly
df1['occurrence'] = df1.groupby('ID').cumcount()
df2 = df1.pivot(
index='ID',
values=['Original_Status','New_Status'],
columns='occurrence',
)
df2.columns = [s+'_'+str(o+1) for s,o in df2.columns]
c_order = sorted(df2.columns, key = lambda c: c[-1]) #re-order the columns
df2 = df2[c_order]
df2

using pandas, extract data from long format df and add it to wide format df

I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.
import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})
df1 looks like:
ID colA_1 date1 colA_2 date2
0 1 1 1.1.2001 4 1.1.2002
1 2 2 2.1.2001 5 2.1.2002
2 3 3 3.1.2001 6 3.1.2002
df2 looks like:
ID col1 date1 col3 col4
0 1 1.0 1.1.2001 11 21
1 1 1.5 1.1.2002 12 22
2 2 2.0 2.1.2001 13 23
3 2 2.5 2.1.2002 14 24
4 3 3.0 3.1.2001 15 25
5 3 3.5 3.1.2002 16 26
6 3 4.0 4.1.2002 17 27
I want to take a given column from df2, "col3", and then:
(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".
(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".
(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.
So, the output of this output dataframe, df3, should look like this:
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
What is the best way to do this?
I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.
e.g.: the following code
df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])
gives the error
ValueError: Can only compare identically-labeled Series objects
I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).
For example, the below code:
df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')
fails with a Key Error.
Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.
Any help with this would be much appreciated.

You can the pd.wide_to_long after replacing the underscore , this will unpivot the dataframe which you can use to merge with df2 and then pivot back using unstack:
m =df1.rename(columns=lambda x: x.replace('_',''))
unpiv = pd.wide_to_long(m,['colA','date'],'ID','v').reset_index()
merge_piv = (unpiv.merge(df2[['ID','date','col3']],on=['ID','date'],how='left')
.set_index(['ID','v'])['col3'].unstack().add_prefix('colB_'))
final = df1.merge(merge_piv,left_on='ID',right_index=True)
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16

Renaming columns in pandas dataframe error [duplicate]

I have the following data frames:
print(df_a)
mukey DI PI
0 100000 35 14
1 1000005 44 14
2 1000006 44 14
3 1000007 43 13
4 1000008 43 13
print(df_b)
mukey niccdcd
0 190236 4
1 190237 6
2 190238 7
3 190239 4
4 190240 7
When I try to join these data frames:
join_df = df_a.join(df_b, on='mukey', how='left')
I get the error:
*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')
Why is this so? The data frames do have common 'mukey' values.

Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don't overlap it requires you to supply a suffix for the left and right hand side:
In [173]:
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
mukey_left DI PI mukey_right niccdcd
index
0 100000 35 14 NaN NaN
1 1000005 44 14 NaN NaN
2 1000006 44 14 NaN NaN
3 1000007 43 13 NaN NaN
4 1000008 43 13 NaN NaN
merge works because it doesn't have this restriction:
In [176]:
df_a.merge(df_b, on='mukey', how='left')
Out[176]:
mukey DI PI niccdcd
0 100000 35 14 NaN
1 1000005 44 14 NaN
2 1000006 44 14 NaN
3 1000007 43 13 NaN
4 1000008 43 13 NaN

The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.
Please find the two examples that should work in your case:
join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')
or
join_df = df_a.merge(df_b, on='mukey', how='left')

This error indicates that the two tables have one or more column names that have the same column name.
The error message translates to: "I can see the same column in both tables but you haven't told me to rename either one before bringing them into the same table"
You either want to delete one of the columns before bringing it in from the other on using del df['column name'], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought in.
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

The error indicates that the two tables have the 1 or more column names that have the same column name.
Anyone with the same error who doesn't want to provide a suffix can rename the columns instead. Also make sure the index of both DataFrames match in type and value if you don't want to provide the on='mukey' setting.
# rename example
df_a = df_a.rename(columns={'a_old': 'a_new', 'a2_old': 'a2_new'})
# set the index
df_a = df_a.set_index(['mukus'])
df_b = df_b.set_index(['mukus'])
df_a.join(df_b)

Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised

NaNs after merging two dataframes

I have two dataframes like the following:
df1
id name
-------------------------
0 43 c
1 23 t
2 38 j
3 9 s
df2
user id
--------------------------------------------------
0 222087 27,26
1 1343649 6,47,17
2 404134 18,12,23,22,27,43,38,20,35,1
3 1110200 9,23,2,20,26,47,37
I want to split all the ids in df2 into multiple rows and join the resultant dataframe to df1 on "id".
I do the following:
b = pd.DataFrame(df2['id'].str.split(',').tolist(), index=df2.user_id).stack()
b = b.reset_index()[[0, 'user_id']] # var1 variable is currently labeled 0
b.columns = ['Item_id', 'user_id']
When I try to merge, I get NaNs in the resultant dataframe.
pd.merge(b, df1, on = "id", how="left")
id user name
-------------------------------------
0 27 222087 NaN
1 26 222087 NaN
2 6 1343649 NaN
3 47 1343649 NaN
4 17 1343649 NaN
So, I tried doing the following:
b['name']=np.nan
for i in range(0, len(df1)):
b['name'][(b['id'] == df1['id'][i])] = df1['name'][i]
It still gives the same result as above. I am confused as to what could cause this because I am sure both of them should work!
Any help would be much appreciated!
I read similar posts on SO but none seemed to have a concrete answer. I am also not sure if this is not at all related to coding or not.
Thanks in advance!

Problem is you need convert column id in df2 to int, because output of string functions is always string, also if works with numeric.
df2.id = df2.id.astype(int)
Another solution is convert df1.id to string:
df1.id = df1.id.astype(str)
And get NaNs because no match - str values doesnt match with int values.

How to df.groupby(cols).apply(my_func) for some columns, while leave a few columns not tackled?

Suppose I have a Pandas dataframe df has columns a,b,c,d...z . And I want to: df.groupby('a').apply(my_func()) for columns d-z, while leave column 'b' & 'c' unchanged . How to do that ?
I notice Pandas can apply different function to different column by passing a dict . But I have a long column list and just want parameters to set or tip to simply tell Pandas to bypass some columns and apply my_func() to rest of columns ? (Otherwise I have to build a long dict)

One simple (and general) approach is to create a view of the dataframe with the subset you are interested in (or, stated for your case, a view with all columns except the ones you want to ignore), and then use APPLY for that view.
In [116]: df
Out[116]:
a b c d f
0 one 3 0.493808 a bob
1 two 8 0.150585 b alice
2 one 6 0.641816 c michael
3 two 5 0.935653 d joe
4 one 1 0.521159 e kate
Use your favorite methods to create the view you need. You could select a range of columns like so df_view = df.ix[:,'b':'d'], but the following might be more useful for your scenario:
#I want all columns except two
cols = df.columns.tolist()
mycols = [x for x in cols if not x in ['a','f']]
df_view = df[mycols]
Apply your function to that view. (Note this doesn't yet change anything in df.)
In [158]: df_view.apply(lambda x: x /2)
Out[158]:
b c d
0 1 0.246904 20
1 4 0.075293 25
2 3 0.320908 28
3 2 0.467827 28
4 0 0.260579 24
Update the df using update()
In [156]: df.update(df_view.apply(lambda x: x/2))
In [157]: df
Out[157]:
a b c d f
0 one 1 0.246904 20 bob
1 two 4 0.075293 25 alice
2 one 3 0.320908 28 michael
3 two 2 0.467827 28 joe
4 one 0 0.260579 24 kate

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find difference between two data frames - python

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames? In other word, a data frame that has all the rows/columns in df1 that are not in df2?

Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values. newDf = df1.set_index('Name1') .drop(df2['Name2'], errors='ignore') .reset_index(drop=False)

Pandas now offers a new API to do data frame diff: pandas.DataFrame.compare df.compare(df2) col1 col3 self other self other 0 a c NaN NaN 2 NaN NaN 3.0 4.0

A slight variation of the nice #liangli's solution that does not require to change the index of existing dataframes: newdf = df1.drop(df1.join(df2.set_index('Name').index))

Another possible solution is to use numpy broadcasting: df1[np.all(~np.all(df1.values == df2.values[:, None], axis=2), axis=0)] Output: Name Age 1 Mike 45 4 Marry 27 7 Bolt 39

Using the lambda function you can filter the rows with _merge value “left_only” to get all the rows in df1 which are missing from df2 df3 = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x :x['_merge']=='left_only'] df

Try this one: df_new = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', 1) It will result a new dataframe with the differences: the values that exist in df1 but not in df2.

Related

Pandas: Aggregating and transposing dataframe based on string

using pandas, extract data from long format df and add it to wide format df

Renaming columns in pandas dataframe error [duplicate]

NaNs after merging two dataframes

How to df.groupby(cols).apply(my_func) for some columns, while leave a few columns not tackled?

Categories

Resources