Merge two different dataframes on different column names [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two dataframes,
df1 = pd.DataFrame({'A': ['A1', 'A1', 'A2', 'A3'],
'B': ['121', '345', '123', '146'],
'C': ['K0', 'K1', 'K0', 'K1']})
df2 = pd.DataFrame({'A': ['A1', 'A3'],
'BB': ['B0', 'B3'],
'CC': ['121', '345'],
'DD': ['D0', 'D1']})
Now I need to get the similiar rows from column A and B from df1 and column A and CC from df2.
And so I tried possible merge options, such as:
both_DFS=pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC'])
and this will not give me row information from df2 dataframe which is what I needed. Meaning, I have all column names from df2 but the rows are just empty or Nan.
And then I tried:
Both_DFs=pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC'])[['A','B','CC']]
And this give me error as,
KeyError: "['B'] not in index"
I am aiming to have a merged Dataframe with all columns from both df1 and df2. Any suggestions would be great
Desired output:
Both_DFs
A B C BB CC DD
0 A1 121 K0 B0 121 D0
So in my data frames (df1 and df2), only one row has exact match for both columns of interest. That is, Column A and B from df1 has only one row matching exactly to rows in columns A and CC in df2

Well, if you declare column A as index, it works:
Both_DFs = pd.merge(df1.set_index('A', drop=True),df2.set_index('A', drop=True), how='left',left_on=['B'],right_on=['CC'], left_index=True, right_index=True).dropna().reset_index()
This results in:
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
2 A3 146 K1 B3 345 D1
EDIT
You just needed:
Both_DFs = pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC']).dropna()
Which gives:
A B C BB CC DD
0 A1 121 K0 B0 121 D0

You can also use join with default left join or merge, last if necessary remove rows with NaNs by dropna:
print (df1.join(df2.set_index('A'), on='A').dropna())
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
3 A3 146 K1 B3 345 D1
print (pd.merge(df1, df2, on='A', how='left').dropna())
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
3 A3 146 K1 B3 345 D1
EDIT:
I think you need inner join (by default, so on='inner' can be omit):
Both_DFs = pd.merge(df1,df2, left_on=['A','B'],right_on=['A','CC'])
print (Both_DFs)
A B C BB CC DD
0 A1 121 K0 B0 121 D0

I don't know if your example show exactly your problem but,
If we try to merge with MultiIndex, we need to have the 2 index matching.
df1['A'] == df2['A'] && df1['B'] == df2['CC']
Here we haven't any row that match the 2 index.
If we merge just by df1['A'], we got something like this :
Both_DFs=pd.merge(df1, df2, how='left', left_on=['A'], right_on=['A'])
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
2 A2 121 K0 NaN NaN NaN
3 A3 146 K1 B3 345 D1
If you wan't remove line row that not in df2 try to change 'how' method to inner.
Both_DFs=pd.merge(df1, df2, how='left', left_on=['A'], right_on=['A'])
A B C BB CC DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
2 A3 146 K1 B3 345 D1
Did this approach of what you're looking for ?

Related

Understanding the FutureWarning on using join_axes when concatenating with Pandas

I have two DataFrames:
df1:
A B C
1 A1 B1 C1
2 A2 B2 C2
df2:
B C D
3 B3 C3 D3
4 B4 C4 D4
Columns B and C are identical for both.
I'd like to concatenate them vertically and keep the columns of the first DataFrame:
pd.concat([df1, df2], join_axes=[df1.columns]):
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
This works, but raises a
FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
I couldn't find (either in the documentation or through Google) how to "Use .reindex or .reindex_like on the result to achieve the same functionality".
Colab notebook illustrating issue: https://colab.research.google.com/drive/13EBq2z0Nh05JY7ovrdnLGtfeqdKVvZq0
Just like what the error mentioned add reindex
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[286]:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
df1 = pd.DataFrame({'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']})
df2 = pd.DataFrame({'B': ['B3', 'B4'], 'C': ['C3', 'C4'], 'D': ['D1', 'D2']})
pd.concat([df1, df2], sort=False)[df1.columns]
yields the desired result.
OR...
pd.concat([df1, df2], sort=False).reindex(df1.columns, axis=1)
Output:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4

Concat two dataframes wit common columns [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two dataframes with same columns. Only one column has different values. I want to concatenate the two without duplication.
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['B0', 'B1', 'B2']})
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['A0', 'A1', 'A2']})
df1
Out[630]:
key cat B
0 K0 C0 A0
1 K1 C1 A1
2 K2 C2 A2
df2
Out[631]:
key cat B
0 K0 C0 B0
1 K1 C1 B1
2 K2 C2 B2
I tried:
result = pd.concat([df1, df2], axis=1)
result
Out[633]:
key cat B key cat B
0 K0 C0 A0 K0 C0 B0
1 K1 C1 A1 K1 C1 B1
2 K2 C2 A2 K2 C2 B2
The desired output:
key cat B_df1 B_df2
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2
NOTE: I could drop duplicates afterwards and rename columns but that doesn't seem efficient
pd.merge will do the job
pd.merge(df1,df2, on=['key','cat'])
Output
key cat B_x B_y
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2

Show common values in a column once in pandas

I have a dataframe that looks like this:
df = pd.DataFrame({'key': ['K0', 'K0', 'K0', 'K1'],'cat': ['C0', 'C0', 'C1', 'C1'],'B': ['A0', 'A1', 'A2', 'A3']})
df
Out[15]:
key cat B
0 K0 C0 A0
1 K0 C0 A1
2 K0 C1 A2
3 K1 C1 A3
Is it possible to convert it to:
key cat B
0 K0 C0 A0
1 A1
2 K0 C1 A2
3 K1 C1 A3
I want to avoid showing same value of key & cat again and again and key reappears once cat changes.
It's for an excel purpose so I need it to be compatible with:
style.apply(f)
to_excel()
You can use duplicated over a subset of the columns to look for duplicate values:
cols = ['key', 'cat']
df.loc[df.duplicated(subset=cols), cols] = ''
key cat B
0 K0 C0 A0
1 A1
2 K0 C1 A2
3 K1 C1 A3

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

I have two dataframes, say df1 and df2, with the same column names.
Example:
df1
C1 | C2 | C3 | C4
A 1 2 AA
B 1 3 A
A 3 2 B
df2
C1 | C2 | C3 | C4
A 1 3 E
B 1 2 C
Q 4 1 Z
I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.
What would be a clean way to do this in Pandas?
So far, based on this answer, I have been able to find the common rows.
common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])
This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.
I have also seen this thread, but the answers all seem to assume a difference on all the columns.
The expected result for the above example (rows common on specified columns removed):
C1 | C2 | C3 | C4
A 3 2 B
Maybe not the cleanest, but you could add a key column to df1 to check against.
Setting up the datasets
import pandas as pd
df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
'C2': [1, 1, 3],
'C3': [2, 3, 2],
'C4': ['AA', 'A', 'B']})
df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
'C2': [1, 1, 4],
'C3': [3, 2, 1],
'C4': ['E', 'C', 'Z']})
Adding a key, using your code to find the commons
df1['key'] = range(1, len(df1) + 1)
common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)
You can use an anti-join method where you do an outer join on the specified columns while returning the method of the join with an indicator. Only downside is that you'd have to rename and drop the extra columns after the join.
>>> import pandas as pd
>>> df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
>>> df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
>>> df_merged = df1.merge(df2, on=['C1','C2'], indicator=True, how='outer')
>>> df_merged
C1 C2 C3_x C4_x C3_y C4_y _merge
0 A 1 2.0 AA 3.0 E both
1 B 1 3.0 A 2.0 C both
2 A 3 2.0 B NaN NaN left_only
3 Q 4 NaN NaN 1.0 Z right_only
>>> df1_setdiff = df_merged[df_merged['_merge'] == 'left_only'].rename(columns={'C3_x': 'C3', 'C4_x': 'C4'}).drop(['C3_y', 'C4_y', '_merge'], axis=1)
>>> df1_setdiff
C1 C2 C3 C4
2 A 3 2.0 B
>>> df2_setdiff = df_merged[df_merged['_merge'] == 'right_only'].rename(columns={'C3_y': 'C3', 'C4_y': 'C4'}).drop(['C3_x', 'C4_x', '_merge'], axis=1)
>>> df2_setdiff
C1 C2 C3 C4
3 Q 4 1.0 Z
import pandas as pd
df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
common = pd.merge(df1, df2,on=['C1','C2'])
R1 = df1[~((df1.C1.isin(common.C1))&(df1.C2.isin(common.C2)))]
R2 = df2[~((df2.C1.isin(common.C1))&(df2.C2.isin(common.C2)))]
df1:
C1 C2 C3 C4
0 A 1 2 AA
1 B 1 3 A
2 A 3 2 B
df2:
C1 C2 C3 C4
0 A 1 3 E
1 B 1 2 C
2 Q 4 1 Z
common:
C1 C2 C3_x C4_x C3_y C4_y
0 A 1 2 AA 3 E
1 B 1 3 A 2 C
R1:
C1 C2 C3 C4
2 A 3 2 B
R2:
C1 C2 C3 C4
2 Q 4 1 Z

Pandas: How to expand data frame rows containing a dictionary with varying keys in a column?

I'm a little stuck, can you please help me with this. I've simplified the problem I'm facing to the following:
Input
Desired Output
I know how to handle the case where the dictionaries in col. c have same keys.
You can create DataFrame by constructor, reshape by stack and last join to original:
df1 = (pd.DataFrame(df.c.values.tolist())
.stack()
.reset_index(level=1)
.rename(columns={0:'val','level_1':'key'}))
print (df1)
key val
0 c00 v00
0 c01 v01
1 c10 v10
2 c20 v20
2 c21 v21
2 c22 v22
df = df.drop('c', 1).join(df1).reset_index(drop=True)
print (df)
a b key val
0 a0 b0 c00 v00
1 a0 b0 c01 v01
2 a1 b1 c10 v10
3 a2 b2 c20 v20
4 a2 b2 c21 v21
5 a2 b2 c22 v22
Here is one way:
import pandas as pd
from itertools import chain
df = pd.DataFrame([['a0', 'b0', {'c00': 'v00', 'c01': 'v01'}],
['a1', 'b1', {'c10': 'v10'}],
['a2', 'b2', {'c20': 'v20', 'c21': 'v21', 'c22': 'v22'}] ],
columns=['a', 'b', 'c'])
# first convert 'c' to list of tuples
df['c'] = df['c'].apply(lambda x: list(x.items()))
lens = list(map(len, df['c']))
# create dataframe
df_out = pd.DataFrame({'a': np.repeat(df['a'].values, lens),
'b': np.repeat(df['b'].values, lens),
'c': list(chain.from_iterable(df['c'].values))})
# unpack tuple
df_out = df_out.join(df_out['c'].apply(pd.Series))\
.rename(columns={0: 'key', 1: 'val'}).drop('c', 1)
# a b key val
# 0 a0 b0 c00 v00
# 1 a0 b0 c01 v01
# 2 a1 b1 c10 v10
# 3 a2 b2 c20 v20
# 4 a2 b2 c21 v21
# 5 a2 b2 c22 v22
My solution is next:
import pandas as pd
t=pd.DataFrame([['a0','b0',{'c00':'v00','c01':'v01'}],['a1','b1',{'c10':'v10'}],['a2','b2',{'c20':'v20','c21':'v21','c22':'v22'}]],columns=['a','b','c'])
l2=[]
for i in t.index:
for j in t.loc[i,'c']:
l2+=[[t.loc[i,'a'],t.loc[i,'b'],j,t.loc[i,'c'][j]]]
t2=pd.DataFrame(l2,columns=['a','b','key','val'])
where 't' is your DataFrame, which you obtain as you want.

Categories

Resources