Python : How to compare two data frames - python

I have two data frames:
df1
A1 B1
1 a
2 s
3 d
and
df2
A1 B1
1 a
2 x
3 d
I want to compare df1 and df2 on column B1. The column A1 can be used to join. I want to know:
Which rows are different in df1 and df2 with respect to column B1?
If there is a mismatch in the values of column A1. For example whether df2 is missing some values that are there in df1 and vice versa. And if so, which ones?
I tried using merge and join but that is not what I am looking for.

I've edited the raw data to illustrate the case of A1 keys in one dataframe but not the other.
When doing your merge, you want to specify an 'outer' merge so that you can see these items with an A1 key in one dataframe but not the other.
I've included the suffixes '_1' and '_2' to indicate the dataframe source (_1 = df1 and _2 = df2) of column B1.
df1 = pd.DataFrame({'A1': [1, 2, 3, 4], 'B1': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'A1': [1, 2, 3, 5], 'B1': ['a', 'd', 'c', 'e']})
df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'])
df3['check'] = df3.B1_1 == df3.B1_2
>>> df3
A1 B1_1 B1_2 check
0 1 a a True
1 2 b d False
2 3 c c True
3 4 d NaN False
4 5 NaN e False
To check for missing A1 keys in df1 and df2:
# A1 value missing in `df1`
>>> d3[df3.B1_1.isnull()]
A1 B1_1 B1_2 check
4 5 NaN e False
# A1 value missing in `df2`
>>> df3[df3.B1_2.isnull()]
A1 B1_1 B1_2 check
3 4 d NaN False
EDIT
Thanks to #EdChum (the source of all Pandas knowledge...).
df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'], indicator=True)
df3['check'] = df3.B1_1 == df3.B1_2
>>> df3
A1 B1_1 B1_2 _merge check
0 1 a a both True
1 2 b d both False
2 3 c c both True
3 4 d NaN left_only False
4 5 NaN e right_only False

Related

Replace values in pandas.DataFrame using MultiIndex [duplicate]

This question already has answers here:
Replacing Rows in Pandas DataFrame with Other DataFrame Based on Index
(3 answers)
Closed 6 months ago.
What I want to do
I have two pandas.DataFrame, df1 and df2. Both have the same columns.
All indices in df2 are also found in df1, but there are some indices that only df1 has.
Rows with an index that is owned by both df1 and df2, use rows of df2.
Rows with an index that is owned only by df1, use rows of df1.
In short, "replaces values of df1 with values of df2 based on MultiIndex".
import pandas as pd
index_names = ['index1', 'index2']
columns = ['column1', 'column2']
data1 = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]
index1 = [['i1', 'i1', 'i1', 'i2', 'i2'], ['A', 'B', 'C', 'B', 'C']]
df1 = pd.DataFrame(data1, index=pd.MultiIndex.from_arrays(index1, names=index_names), columns=columns)
print(df1)
## OUTPUT
# column1 column2
#index1 index2
#i1 A 1 2
# B 2 3
# C 3 4
#i2 B 4 5
# C 5 6
data2 = [[11, 12], [12, 13]]
index2 = [['i2', 'i1'], ['C', 'C']]
df2 = pd.DataFrame(data2, index=pd.MultiIndex.from_arrays(index2, names=index_names), columns=columns)
print(df2)
## OUTPUT
# column1 column2
#index1 index2
#i2 C 11 12
#i1 C 12 13
## DO SOMETHING!
## EXPECTED OUTPUT
# column1 column2
#index1 index2
#i1 A 1 2
# B 2 3
# C 12 13 # REPLACED!
#i2 B 4 5
# C 11 12 # REPLACED!
Environment
Python 3.10.5
Pandas 1.4.3
You can use direct assignment via .loc or a call to .update
>>> df3 = df1.copy()
>>> df3.update(df2)
>>> df3
column1 column2
index1 index2
i1 A 1.0 2.0
B 2.0 3.0
C 12.0 13.0
i2 B 4.0 5.0
C 11.0 12.0
You can try with pd.merge()on df2 and then fill the missing values from df1 with .fillna():
pd.merge(df2, df1, left_index=True, right_index=True, how='right', suffixes=('', '_df1')).fillna(df1).iloc[:,:2]
I assume that there is a typo in your question and it should be:
index2 = [['i2', 'i1'], ['C', 'C']]

replace empty strings in a dataframe with values from another dataframe iwth a different index

two sample dataframes with different index values, but identical column names and order:
df1 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[2,4])
df2 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[7,9])
df1
A B C
2 1 3
4 2
df2
A B C
7 4
9 5 6
I know how to concat the two dataframes, but that gives this:
A B C
2 1 3
4 2
Omitting the non=matching indexes from the other df
result I am trying to achieve is:
A B C
0 1 4 3
1 5 2 6
I want to combine the rows with the same index values from each df so that missing values in one df are replaced by the corresponding value in the other.
Concat and Merge are not up to the job I have found.
I assume I have to have identical indexes in each df which correspond to the values I want to merge into one row. But, so far, no luck getting it to come out correctly. Any pandas transformational wisdom is appreciated.
This merge attempt did not do the trick:
df1.merge(df2, on='A', how='outer')
The solutions below were all offered before I edited the question. My fault there, I neglected to point out that my actual data has different indexes in the two dataframes.
Let us try mask
out = df1.mask(df1=='',df2)
Out[428]:
A B C
0 1 4 3
1 5 2 6
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
if df1.iloc[i,j]=="":
df1.iloc[i,j] = df2.iloc[i,j]
print(df1)
A B C
0 1 4 3
1 5 2 6
Since the index of your two dataframes are different, it's easier to make it into the same index.
index = [i for i in range(len(df1))]
df1.index = index
df2.index = index
ddf = df1.replace('',np.nan)).fillna(df2)
If both df1 and df2 have different size of datas, it's still workable.
df1 = pd.DataFrame([[1, '', 3], ['', 2, ''],[7,8,9],[10,11,12]], columns=['A', 'B', 'C'],index=[7,8,9,10])
index1 = [i for i in range(len(df1))]
index2 = [i for i in range(len(df2))]
df1.index = index1
df2.index = index2
df1.replace('',np.nan).fillna(df2)
You can get
Out[17]:
A B C
0 1.0 5 3.0
1 4 2.0 6
2 7.0 8.0 9.0
3 10.0 11.0 12.0

How to join Panda DataFrames based on List values in a column [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 4 years ago.
There are two Pandas DataFrame
df_A = pd.DataFrame([['r1', ['a','b']], ['r2',['aabb','b']], ['r3', ['xyz']]], columns=['col1', 'col2'])
col1 col2
r1 [a, b]
r2 [aabb, b]
r3 [xyz]
df_B = pd.DataFrame([['a', 10], ['b',2]], columns=['C1', 'C2'])
C1 C2
a 10
b 2
I want to join both dataframes such as df_C is
col1 C1 C2
r1 a 10
r1 b 2
r2 aabb 0
r2 b 2
r3 xyz 0
You need:
df = pd.DataFrame([['r1', ['a','b']], ['r2',['aabb','b']], ['r3', ['xyz']]], columns=['col1', 'col2'])
df= pd.DataFrame({'col1':np.repeat(df.col1.values, df.col2.str.len()),
'C1':np.concatenate(df.col2.values)})
df_B = pd.DataFrame([['a', 10], ['b',2]], columns=['C1', 'C2'])
df_B = dict(zip(df_B.C1, df_B.C2))
# {'a': 10, 'b': 2}
df['C2']= df['C1'].apply(lambda x: df_B[x] if x in df_B.keys() else 0)
print(df)
Output:
col1 C1 C2
0 r1 a 10
1 r1 b 2
2 r2 aabb 0
3 r2 b 2
4 r3 xyz 0
Edit
The below code will give you the length of the list in each row.
print(df.col2.str.len())
# 0 2
# 1 2
# 2 1
np.repeat will repeat the values from col1 based length obtained using above.
eg. r1,r2 will repeat twice.
print(np.repeat(df.col1.values, df.col2.str.len())
# ['r1' 'r1' 'r2' 'r2' 'r3']
Using np.concatenate on col2.values will result in plain 1D List
print(np.concatenate(df.col2.values))
# ['a' 'b' 'aabb' 'b' 'xyz']

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

I have two dataframes, say df1 and df2, with the same column names.
Example:
df1
C1 | C2 | C3 | C4
A 1 2 AA
B 1 3 A
A 3 2 B
df2
C1 | C2 | C3 | C4
A 1 3 E
B 1 2 C
Q 4 1 Z
I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.
What would be a clean way to do this in Pandas?
So far, based on this answer, I have been able to find the common rows.
common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])
This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.
I have also seen this thread, but the answers all seem to assume a difference on all the columns.
The expected result for the above example (rows common on specified columns removed):
C1 | C2 | C3 | C4
A 3 2 B
Maybe not the cleanest, but you could add a key column to df1 to check against.
Setting up the datasets
import pandas as pd
df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
'C2': [1, 1, 3],
'C3': [2, 3, 2],
'C4': ['AA', 'A', 'B']})
df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
'C2': [1, 1, 4],
'C3': [3, 2, 1],
'C4': ['E', 'C', 'Z']})
Adding a key, using your code to find the commons
df1['key'] = range(1, len(df1) + 1)
common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)
You can use an anti-join method where you do an outer join on the specified columns while returning the method of the join with an indicator. Only downside is that you'd have to rename and drop the extra columns after the join.
>>> import pandas as pd
>>> df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
>>> df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
>>> df_merged = df1.merge(df2, on=['C1','C2'], indicator=True, how='outer')
>>> df_merged
C1 C2 C3_x C4_x C3_y C4_y _merge
0 A 1 2.0 AA 3.0 E both
1 B 1 3.0 A 2.0 C both
2 A 3 2.0 B NaN NaN left_only
3 Q 4 NaN NaN 1.0 Z right_only
>>> df1_setdiff = df_merged[df_merged['_merge'] == 'left_only'].rename(columns={'C3_x': 'C3', 'C4_x': 'C4'}).drop(['C3_y', 'C4_y', '_merge'], axis=1)
>>> df1_setdiff
C1 C2 C3 C4
2 A 3 2.0 B
>>> df2_setdiff = df_merged[df_merged['_merge'] == 'right_only'].rename(columns={'C3_y': 'C3', 'C4_y': 'C4'}).drop(['C3_x', 'C4_x', '_merge'], axis=1)
>>> df2_setdiff
C1 C2 C3 C4
3 Q 4 1.0 Z
import pandas as pd
df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
common = pd.merge(df1, df2,on=['C1','C2'])
R1 = df1[~((df1.C1.isin(common.C1))&(df1.C2.isin(common.C2)))]
R2 = df2[~((df2.C1.isin(common.C1))&(df2.C2.isin(common.C2)))]
df1:
C1 C2 C3 C4
0 A 1 2 AA
1 B 1 3 A
2 A 3 2 B
df2:
C1 C2 C3 C4
0 A 1 3 E
1 B 1 2 C
2 Q 4 1 Z
common:
C1 C2 C3_x C4_x C3_y C4_y
0 A 1 2 AA 3 E
1 B 1 3 A 2 C
R1:
C1 C2 C3 C4
2 A 3 2 B
R2:
C1 C2 C3 C4
2 Q 4 1 Z

How to keep column MultiIndex values when merging pandas DataFrames

I have two pandas DataFrames, as below:
df1 = pd.DataFrame({('Q1', 'SubQ1'):[1, 2, 3], ('Q1', 'SubQ2'):[1, 2, 3], ('Q2', 'SubQ1'):[1, 2, 3]})
df1['ID'] = ['a', 'b', 'c']
df2 = pd.DataFrame({'item_id': ['a', 'b', 'c'], 'url':['a.com', 'blah.com', 'company.com']})
df1:
Q1 Q2 ID
SubQ1 SubQ2 SubQ1
0 1 1 1 a
1 2 2 2 b
2 3 3 3 c
df2:
item_id url
0 a a.com
1 b blah.com
2 c company.com
Note that df1 has some columns with hierarchical indexing (eg. ('Q1', 'SubQ1')) and some with just normal indexing (eg. ID).
I want to merge these two data frames on the ID and item_id fields. Using:
result = pd.merge(df1, df2, left_on='ID', right_on='item_id')
gives:
(Q1, SubQ1) (Q1, SubQ2) (Q2, SubQ1) (ID, ) item_id url
0 1 1 1 a a a.com
1 2 2 2 b b blah.com
2 3 3 3 c c company.com
As you can see, the merge itself works fine, but the MultiIndex has been lost and has reverted to tuples. I've tried to recreate the MultiIndex by using pd.MultiIndex.from_tuples, as in:
result.columns = pd.MultiIndex.from_tuples(result)
but this causes problems with the item_id and url columns, taking just the first two characters of their names:
Q1 Q2 ID i u
SubQ1 SubQ2 SubQ1 t r
0 1 1 1 a a a.com
1 2 2 2 b b blah.com
2 3 3 3 c c company.com
Converting the columns in df2 to be one-element tuples (ie. ('item_id',) rather than just 'item_id') makes no difference.
How can I merge these two DataFrames and keep the MultiIndex properly? Or alternatively, how can I take the result of the merge and get back to columns with a proper MultiIndex without mucking up the names of the item_id and url columns?
If you can't beat 'em, join 'em. (Make both DataFrames have the same number of index levels before merging):
import pandas as pd
df1 = pd.DataFrame({('Q1', 'SubQ1'):[1, 2, 3], ('Q1', 'SubQ2'):[1, 2, 3], ('Q2', 'SubQ1'):[1, 2, 3]})
df1['ID'] = ['a', 'b', 'c']
df2 = pd.DataFrame({'item_id': ['a', 'b', 'c'], 'url':['a.com', 'blah.com', 'company.com']})
df2.columns = pd.MultiIndex.from_product([df2.columns, ['']])
result = pd.merge(df1, df2, left_on='ID', right_on='item_id')
print(result)
yields
Q1 Q2 ID item_id url
SubQ1 SubQ2 SubQ1
0 1 1 1 a a a.com
1 2 2 2 b b blah.com
2 3 3 3 c c company.com
This also avoids the UserWarning:
pandas/core/reshape/merge.py:551: UserWarning: merging between different levels can give an unintended result (2 levels on the left, 1 on the right)
The column for ID is not "non-hierarchical". It is signified by ('ID', ). However, pandas allows you to reference just the first level of columns in a way that looks like you are referencing a single leveled column structure. Meaning this should work df1['ID'] as well as as df1[('ID',)] as well as df1.loc[:, ('ID',)]. But if it happened to be that the top level 'ID' had more columns associated with it in the second level, df1['ID'] would return a dataframe. I feel more comfortable with this solution, which looks a lot like #JohnGalt's answer in the comments.
df1.assign(u=df1[('ID', '')].map(df2.set_index('item_id').url))
Q1 Q2 ID u
SubQ1 SubQ2 SubQ1
0 1 1 1 a a.com
1 2 2 2 b blah.com
2 3 3 3 c company.com
Joining on a single level column'd dataframe to a multi-level column'd dataframe is difficult. I have to artificially add another level.
def rnm(d):
d = d.copy()
d.columns = [d.columns, [''] * len(d.columns)]
return d
df1.join(rnm(df2.set_index('item_id')), on=('ID',))
Q1 Q2 ID url
SubQ1 SubQ2 SubQ1
0 1 1 1 a a.com
1 2 2 2 b blah.com
2 3 3 3 c company.com
This solution is more flexible in the sense that you won't have to insert columns levels before concat, you can use it to concat any number of levels:
import pandas as pd
df1 = pd.DataFrame({('A', 'b'): [1, 2], ('A', 'c'): [3, 4]})
df2 = pd.DataFrame({'Zaa': [1, 2]})
df3 = pd.DataFrame({('Maaa', 'k', 'l'): [1, 2]})
df = pd.concat([df1, df2, df3], axis=1)
cols = [col if isinstance(col, tuple) else (col, ) for col in df.columns]
df.columns = pd.MultiIndex.from_tuples(cols)

Categories

Resources