This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two dataframes with same columns. Only one column has different values. I want to concatenate the two without duplication.
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['B0', 'B1', 'B2']})
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['A0', 'A1', 'A2']})
df1
Out[630]:
key cat B
0 K0 C0 A0
1 K1 C1 A1
2 K2 C2 A2
df2
Out[631]:
key cat B
0 K0 C0 B0
1 K1 C1 B1
2 K2 C2 B2
I tried:
result = pd.concat([df1, df2], axis=1)
result
Out[633]:
key cat B key cat B
0 K0 C0 A0 K0 C0 B0
1 K1 C1 A1 K1 C1 B1
2 K2 C2 A2 K2 C2 B2
The desired output:
key cat B_df1 B_df2
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2
NOTE: I could drop duplicates afterwards and rename columns but that doesn't seem efficient
pd.merge will do the job
pd.merge(df1,df2, on=['key','cat'])
Output
key cat B_x B_y
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2
Related
I have two DataFrames:
df1:
A B C
1 A1 B1 C1
2 A2 B2 C2
df2:
B C D
3 B3 C3 D3
4 B4 C4 D4
Columns B and C are identical for both.
I'd like to concatenate them vertically and keep the columns of the first DataFrame:
pd.concat([df1, df2], join_axes=[df1.columns]):
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
This works, but raises a
FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
I couldn't find (either in the documentation or through Google) how to "Use .reindex or .reindex_like on the result to achieve the same functionality".
Colab notebook illustrating issue: https://colab.research.google.com/drive/13EBq2z0Nh05JY7ovrdnLGtfeqdKVvZq0
Just like what the error mentioned add reindex
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[286]:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
df1 = pd.DataFrame({'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']})
df2 = pd.DataFrame({'B': ['B3', 'B4'], 'C': ['C3', 'C4'], 'D': ['D1', 'D2']})
pd.concat([df1, df2], sort=False)[df1.columns]
yields the desired result.
OR...
pd.concat([df1, df2], sort=False).reindex(df1.columns, axis=1)
Output:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
I have 3 dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'B': ['B0', 'B1', 'B2', 'B3'],\
'C': ['C0', 'C1', 'C2', 'C3'],\
'D': ['D0', 'D1', 'D2', 'D3']},\
index=[0,1,2,3])
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'E': ['E0', 'E1', 'E2', 'E3']},\
index=[0,1,2,3])
df3 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\
'F': ['F0', 'F1', 'F2', 'F3']},\
index=[0,1,2,3])
I want to combine them together to get the following results:
A B C D E F
0 A0 B0 C0 D0 E0 F0
1 A1 B1 C1 D1 E1 F1
2 A2 B2 C2 D2 E2 F2
3 A3 B3 C3 D3 E3 F3
When I try to combine them, I keep getting:
A B C D A E A F
0 A0 B0 C0 D0 A0 E0 A0 F0
1 A1 B1 C1 D1 A1 E1 A1 F1
2 A2 B2 C2 D2 A2 E2 A2 F2
3 A3 B3 C3 D3 A3 E3 A3 F3
The common column (A) is duplicated once for each dataframe used in the concat call. I have tried various combinations on:
df4 = pd.concat([df1, df2, df3], axis=1, sort=False)
Some variations have been disastrous while some keep giving the undesired result. Any suggestions would be much appreciated. Thanks.
Try
df4 = (pd.concat((df.set_index('A') for df in (df1,df2,df3)), axis=1)
.reset_index()
)
Output:
A B C D E F
0 A0 B0 C0 D0 E0 F0
1 A1 B1 C1 D1 E1 F1
2 A2 B2 C2 D2 E2 F2
3 A3 B3 C3 D3 E3 F3
I have a dataframe that looks like this:
df = pd.DataFrame({'key': ['K0', 'K0', 'K0', 'K1'],'cat': ['C0', 'C0', 'C1', 'C1'],'B': ['A0', 'A1', 'A2', 'A3']})
df
Out[15]:
key cat B
0 K0 C0 A0
1 K0 C0 A1
2 K0 C1 A2
3 K1 C1 A3
Is it possible to convert it to:
key cat B
0 K0 C0 A0
1 A1
2 K0 C1 A2
3 K1 C1 A3
I want to avoid showing same value of key & cat again and again and key reappears once cat changes.
It's for an excel purpose so I need it to be compatible with:
style.apply(f)
to_excel()
You can use duplicated over a subset of the columns to look for duplicate values:
cols = ['key', 'cat']
df.loc[df.duplicated(subset=cols), cols] = ''
key cat B
0 K0 C0 A0
1 A1
2 K0 C1 A2
3 K1 C1 A3
I am really struggling to understand the "left_index" and "right_index" arguments in pandas.merge. I read the documentation, searched around, experimented with various setting and tried to understand but I am still confused. Consider this example:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'E': [1,2,3,4]})
Now, when I run the following command:
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, left_index=True)
I get:
key1_x key2_x A B key1_y key2_y C D E _merge
0 K0 K0 A0 B0 K0 K0 C0 D0 1.0 both
1 K0 K1 A1 B1 K1 K0 C1 D1 2.0 both
2 K0 K1 A1 B1 K1 K0 C2 D2 3.0 both
3 K1 K0 A2 B2 NaN NaN NaN NaN NaN left_only
3 K2 K1 A3 B3 NaN NaN NaN NaN NaN left_only
3 NaN NaN NaN NaN K2 K0 C3 D3 4.0 right_only
However, running the same with right_index=True gives an error. Same if I introduce both. More interestingly, running the following merge gives a very unexpected result
pd.merge(left, right, on=['key1', 'key2'],how='outer', validate = 'one_to_many', indicator=True, left_index = True, right_index = True)
Result is:
key1 key2 A B C D E _merge
0 K0 K0 A0 B0 C0 D0 1 both
1 K0 K1 A1 B1 C1 D1 2 both
2 K1 K0 A2 B2 C2 D2 3 both
3 K2 K1 A3 B3 C3 D3 4 both
As you can see, all information for right frame for key1 and key2 is completely lost.
Please help me understand the purpose and function of these arguments. Thank you.
Merging happens in a couple of ways:
Column-Column Merge: Use left_on, right_on and how.
Example:
# Gives same answer
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how = 'outer')
pd.merge(left, right, on=['key1', 'key2'], how='outer', indicator=True)
Index-Index Merge: Set left_index and right_index to True or use on and use how.
Example:
pd.merge(left, right, how = 'inner', right_index = True, left_index = True)
# If you make matching unique multi-indexes for both data frames you can do
# pd.merge(left, right, how = 'inner', on = ['indexname1', 'indexname2'])
# In your data frames, you're keys duplicate values so you can't do this
# In general, a column with duplicate values does not make a good key
Column-Index Merge: Use left_on + right_index or left_index + right_on and how.
Note: Both the values in index and left_on must match. If you're index is a integer and you're left_on is a string, you get error. Also, number of indexing levels must match.
Example:
# If how not specified, inner join is used
pd.merge(left, right, right_on=['E'], left_index = True, how = 'outer')
# Gives error because left_on is string and right_index is integer
pd.merge(left, right, left_on=['key1'], right_index = True, how = 'outer')
# This gave you error because left_on has indexing level of 2 but right_index only has indexing level of 1.
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, right_index=True)
You kind of mix up the different types of merges which gave weird results.
If you can't see how the merging is going to happen conceptually, chances are a computer isn't going to do any better.
If I understand the behavior of merge correctly, you should pick only one option for left and right respectively (i.e. You should not pick left_on=['x'] and left_index=True at the same time). Otherwise, strange thing can happen in arbitrary way since it confuses merge as to which key should be actually used as you have shown in current implementation of merge (I have not checked the pandas source in detail, but the behavior can change for different implementations in each version). Here is a small experiment.
>>> left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
>>> right
key1 key2 C D E
0 K0 K0 C0 D0 1
1 K1 K0 C1 D1 2
2 K1 K0 C2 D2 3
3 K2 K0 C3 D3 4
(1) merge using ['key1', 'key2']
>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K1 A3 B3 NaN NaN NaN
5 K2 K0 NaN NaN C3 D3 4.0
(2) Set ['key1', 'key2'] as left index and merge it using the index and keys
>>> left = left.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_on=['key1', 'key2'], how='outer').reset_index(drop=True)
A B key1 key2 C D E
0 A0 B0 K0 K0 C0 D0 1.0
1 A1 B1 K0 K1 NaN NaN NaN
2 A2 B2 K1 K0 C1 D1 2.0
3 A2 B2 K1 K0 C2 D2 3.0
4 A3 B3 K2 K1 NaN NaN NaN
5 NaN NaN K2 K0 C3 D3 4.0
(3) Further set ['key1', 'key2'] as right index and merge it using the index
>>> right = right.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_index=True, how='outer').reset_index()
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K0 NaN NaN C3 D3 4.0
5 K2 K1 A3 B3 NaN NaN NaN
Please note that (1)(2)(3) above are showing the same results, and even if ['key1', 'key2'] are set as index, you can still use left_on = ['key1', 'key2'] instead of left_index=True.
Now, if you really want to merge using both ['key1', 'key2'] with index, one way to achieve this is:
>>> pd.merge(left.reset_index(), right.reset_index(), on=['index', 'key1', 'key2'], how='outer')
index key1 key2 A B C D E
0 0 K0 K0 A0 B0 C0 D0 1.0
1 1 K0 K1 A1 B1 NaN NaN NaN
2 2 K1 K0 A2 B2 C2 D2 3.0
3 3 K2 K1 A3 B3 NaN NaN NaN
4 1 K1 K0 NaN NaN C1 D1 2.0
5 3 K2 K0 NaN NaN C3 D3 4.0
If you read down to here, I'm pretty sure now you know how to achieve above using multiple different ways.
Hope this helps.
I'm a little stuck, can you please help me with this. I've simplified the problem I'm facing to the following:
Input
Desired Output
I know how to handle the case where the dictionaries in col. c have same keys.
You can create DataFrame by constructor, reshape by stack and last join to original:
df1 = (pd.DataFrame(df.c.values.tolist())
.stack()
.reset_index(level=1)
.rename(columns={0:'val','level_1':'key'}))
print (df1)
key val
0 c00 v00
0 c01 v01
1 c10 v10
2 c20 v20
2 c21 v21
2 c22 v22
df = df.drop('c', 1).join(df1).reset_index(drop=True)
print (df)
a b key val
0 a0 b0 c00 v00
1 a0 b0 c01 v01
2 a1 b1 c10 v10
3 a2 b2 c20 v20
4 a2 b2 c21 v21
5 a2 b2 c22 v22
Here is one way:
import pandas as pd
from itertools import chain
df = pd.DataFrame([['a0', 'b0', {'c00': 'v00', 'c01': 'v01'}],
['a1', 'b1', {'c10': 'v10'}],
['a2', 'b2', {'c20': 'v20', 'c21': 'v21', 'c22': 'v22'}] ],
columns=['a', 'b', 'c'])
# first convert 'c' to list of tuples
df['c'] = df['c'].apply(lambda x: list(x.items()))
lens = list(map(len, df['c']))
# create dataframe
df_out = pd.DataFrame({'a': np.repeat(df['a'].values, lens),
'b': np.repeat(df['b'].values, lens),
'c': list(chain.from_iterable(df['c'].values))})
# unpack tuple
df_out = df_out.join(df_out['c'].apply(pd.Series))\
.rename(columns={0: 'key', 1: 'val'}).drop('c', 1)
# a b key val
# 0 a0 b0 c00 v00
# 1 a0 b0 c01 v01
# 2 a1 b1 c10 v10
# 3 a2 b2 c20 v20
# 4 a2 b2 c21 v21
# 5 a2 b2 c22 v22
My solution is next:
import pandas as pd
t=pd.DataFrame([['a0','b0',{'c00':'v00','c01':'v01'}],['a1','b1',{'c10':'v10'}],['a2','b2',{'c20':'v20','c21':'v21','c22':'v22'}]],columns=['a','b','c'])
l2=[]
for i in t.index:
for j in t.loc[i,'c']:
l2+=[[t.loc[i,'a'],t.loc[i,'b'],j,t.loc[i,'c'][j]]]
t2=pd.DataFrame(l2,columns=['a','b','key','val'])
where 't' is your DataFrame, which you obtain as you want.