Python Update two dataframes with identical columns and a few differing rows - python

I am joining two data frames that have the same columns. I wanted to update the first dataframe. However, the my code creates additional columns but it is not updating.
My code:
left = pd.DataFrame({"key": ["K0", "K1", "K2", "K3"],
"A": ["NaN", "NaN", "NaN", "NaN"],
"B": ["B0", "B1", "B2", "B3"],})
right = pd.DataFrame({"key": ["K1", "K2", "K3"],
"A": ["C1", "C2", "C3"],
"B": [ "B1", "B2", "B3"]})
result = pd.merge(left, right, on="key",how='left')
Present output:
result =
key A_x B_x A_y B_y
0 K0 NaN B0 NaN NaN
1 K1 NaN B1 C1 B1
2 K2 NaN B2 C2 B2
3 K3 NaN B3 C3 B3
Expected output:
result =
key B A
0 K0 B0 NaN
1 K1 B1 C1
2 K2 B2 C2
3 K3 B3 C3

Use combine_first:
result = left.set_index("key").combine_first(right.set_index("key")).reset_index()
print(result)
Output
key A B
0 K0 NaN B0
1 K1 C1 B1
2 K2 C2 B2
3 K3 C3 B3

Related

merge two dataframes with common keys and adding unique columns

I have read through the pandas guide, especially merge and join sections, but still can not figure it out.
Basically, this is what I want to do: Let's say we have two data frames:
left = pd.DataFrame(
{ "key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"C": ["B0", "B1", np.nan, np.nan]})
right = pd.DataFrame(
{ "key": ["K2"],
"A": ["A8"],
"D": ["D3"]})
I want to merge them based off on "key" and update the values, filling where necessary and replacing old values if there are any. So it should look like this:
key A C D
0 K0 A0 B0 NaN
1 K1 A1 B1 NaN
2 K2 A8 NaN D3
3 K3 A3 NaN NaN
You can use combine_first with set_index to accomplish your goal here.
right.set_index('key').combine_first(left.set_index('key')).reset_index()
Output:
key A C D
0 K0 A0 B0 NaN
1 K1 A1 B1 NaN
2 K2 A8 NaN D3
3 K3 A3 NaN NaN

Add rows from one dataframe to another based on missing values in a given column pandas

I have been searching a long time for an answer but could not find it. I have two dataframes, one is target, the other backup which both have the same columns. What I want to do is to look at a given column and add all the rows from backup to target which are not in target. The most straightforward solution for this is:
import pandas as pd
import numpy as np
target = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K5"],
"A": ["A1", "A2", "A3", np.nan],
"B": ["B1", "B2", "B3", "B5"],
})
backup = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K4", "K5"],
"A": ["A1", "A", "A3", "A4", "A5"],
"B": ["B1", "B2", "B3", "B4", "B5"],
})
merged = target.copy()
for item in backup.key1.unique():
if item not in target.key1.unique():
merged = pd.concat([merged, backup.loc[backup.key1 == item]])
merged.reset_index(drop=True, inplace=True)
giving
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
4 K4 A4 B4
Now I have tried several things using just pandas where none of them works.
pandas concat
# Does not work because it creates duplicate lines and if dropped, the updated rows which are different will not be dropped -- compare the line with A or NaN
pd.concat([target, backup]).drop_duplicates()
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
1 K2 A B2
3 K4 A4 B4
4 K5 A5 B5
pandas merge
# Does not work because the backup would overwrite data in the target -- NaN
pd.merge(target, backup, how="right")
key1 A B
0 K1 A1 B1
1 K2 A B2
2 K3 A3 B3
3 K4 A4 B4
4 K5 A5 B5
Importantly, it is not a duplicate of this post since I do not want to have a new column and more importantly, the values are not NaN in target, they are simply not there. Furthermore, if then I would use what is proposed for merging the columns, the NaN in the target would be replaced by the value in backup which is unwanted.
It is not a duplicate of this post which uses the combine_first pandas because in that case the NaN is filled by the value from the backup which is wrong:
target.combine_first(backup)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 A4 B5
4 K5 A5 B5
Lastly,
target.join(backup, on=["key1"])
gives me an annoying
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
which I really do not get since both are pure strings and the proposed solution does not work.
So I would like to ask, what am I missing? How can I do it using some pandas methods? Thanks a lot.
Use concat with filtered backup rows with not exist in target.key1 filtered by Series.isin in boolean indexing:
merged = pd.concat([target, backup[~backup.key1.isin(target.key1)]])
print (merged)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
3 K4 A4 B4
Maybe you can try this with a 'subset' parameter in df.drop_duplicates()?
pd.concat([target, backup]).drop_duplicates(subset = "key1")
which gives output:
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
3 K4 A4 B4

Understanding the FutureWarning on using join_axes when concatenating with Pandas

I have two DataFrames:
df1:
A B C
1 A1 B1 C1
2 A2 B2 C2
df2:
B C D
3 B3 C3 D3
4 B4 C4 D4
Columns B and C are identical for both.
I'd like to concatenate them vertically and keep the columns of the first DataFrame:
pd.concat([df1, df2], join_axes=[df1.columns]):
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
This works, but raises a
FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
I couldn't find (either in the documentation or through Google) how to "Use .reindex or .reindex_like on the result to achieve the same functionality".
Colab notebook illustrating issue: https://colab.research.google.com/drive/13EBq2z0Nh05JY7ovrdnLGtfeqdKVvZq0
Just like what the error mentioned add reindex
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[286]:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
df1 = pd.DataFrame({'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']})
df2 = pd.DataFrame({'B': ['B3', 'B4'], 'C': ['C3', 'C4'], 'D': ['D1', 'D2']})
pd.concat([df1, df2], sort=False)[df1.columns]
yields the desired result.
OR...
pd.concat([df1, df2], sort=False).reindex(df1.columns, axis=1)
Output:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4

Concat two dataframes wit common columns [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two dataframes with same columns. Only one column has different values. I want to concatenate the two without duplication.
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['B0', 'B1', 'B2']})
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'cat': ['C0', 'C1', 'C2'],'B': ['A0', 'A1', 'A2']})
df1
Out[630]:
key cat B
0 K0 C0 A0
1 K1 C1 A1
2 K2 C2 A2
df2
Out[631]:
key cat B
0 K0 C0 B0
1 K1 C1 B1
2 K2 C2 B2
I tried:
result = pd.concat([df1, df2], axis=1)
result
Out[633]:
key cat B key cat B
0 K0 C0 A0 K0 C0 B0
1 K1 C1 A1 K1 C1 B1
2 K2 C2 A2 K2 C2 B2
The desired output:
key cat B_df1 B_df2
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2
NOTE: I could drop duplicates afterwards and rename columns but that doesn't seem efficient
pd.merge will do the job
pd.merge(df1,df2, on=['key','cat'])
Output
key cat B_x B_y
0 K0 C0 A0 B0
1 K1 C1 A1 B1
2 K2 C2 A2 B2

Understanding the "left_index" and "right_index" arguments in pandas merge

I am really struggling to understand the "left_index" and "right_index" arguments in pandas.merge. I read the documentation, searched around, experimented with various setting and tried to understand but I am still confused. Consider this example:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'E': [1,2,3,4]})
Now, when I run the following command:
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, left_index=True)
I get:
key1_x key2_x A B key1_y key2_y C D E _merge
0 K0 K0 A0 B0 K0 K0 C0 D0 1.0 both
1 K0 K1 A1 B1 K1 K0 C1 D1 2.0 both
2 K0 K1 A1 B1 K1 K0 C2 D2 3.0 both
3 K1 K0 A2 B2 NaN NaN NaN NaN NaN left_only
3 K2 K1 A3 B3 NaN NaN NaN NaN NaN left_only
3 NaN NaN NaN NaN K2 K0 C3 D3 4.0 right_only
However, running the same with right_index=True gives an error. Same if I introduce both. More interestingly, running the following merge gives a very unexpected result
pd.merge(left, right, on=['key1', 'key2'],how='outer', validate = 'one_to_many', indicator=True, left_index = True, right_index = True)
Result is:
key1 key2 A B C D E _merge
0 K0 K0 A0 B0 C0 D0 1 both
1 K0 K1 A1 B1 C1 D1 2 both
2 K1 K0 A2 B2 C2 D2 3 both
3 K2 K1 A3 B3 C3 D3 4 both
As you can see, all information for right frame for key1 and key2 is completely lost.
Please help me understand the purpose and function of these arguments. Thank you.
Merging happens in a couple of ways:
Column-Column Merge: Use left_on, right_on and how.
Example:
# Gives same answer
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how = 'outer')
pd.merge(left, right, on=['key1', 'key2'], how='outer', indicator=True)
Index-Index Merge: Set left_index and right_index to True or use on and use how.
Example:
pd.merge(left, right, how = 'inner', right_index = True, left_index = True)
# If you make matching unique multi-indexes for both data frames you can do
# pd.merge(left, right, how = 'inner', on = ['indexname1', 'indexname2'])
# In your data frames, you're keys duplicate values so you can't do this
# In general, a column with duplicate values does not make a good key
Column-Index Merge: Use left_on + right_index or left_index + right_on and how.
Note: Both the values in index and left_on must match. If you're index is a integer and you're left_on is a string, you get error. Also, number of indexing levels must match.
Example:
# If how not specified, inner join is used
pd.merge(left, right, right_on=['E'], left_index = True, how = 'outer')
# Gives error because left_on is string and right_index is integer
pd.merge(left, right, left_on=['key1'], right_index = True, how = 'outer')
# This gave you error because left_on has indexing level of 2 but right_index only has indexing level of 1.
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, right_index=True)
You kind of mix up the different types of merges which gave weird results.
If you can't see how the merging is going to happen conceptually, chances are a computer isn't going to do any better.
If I understand the behavior of merge correctly, you should pick only one option for left and right respectively (i.e. You should not pick left_on=['x'] and left_index=True at the same time). Otherwise, strange thing can happen in arbitrary way since it confuses merge as to which key should be actually used as you have shown in current implementation of merge (I have not checked the pandas source in detail, but the behavior can change for different implementations in each version). Here is a small experiment.
>>> left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
>>> right
key1 key2 C D E
0 K0 K0 C0 D0 1
1 K1 K0 C1 D1 2
2 K1 K0 C2 D2 3
3 K2 K0 C3 D3 4
(1) merge using ['key1', 'key2']
>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K1 A3 B3 NaN NaN NaN
5 K2 K0 NaN NaN C3 D3 4.0
(2) Set ['key1', 'key2'] as left index and merge it using the index and keys
>>> left = left.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_on=['key1', 'key2'], how='outer').reset_index(drop=True)
A B key1 key2 C D E
0 A0 B0 K0 K0 C0 D0 1.0
1 A1 B1 K0 K1 NaN NaN NaN
2 A2 B2 K1 K0 C1 D1 2.0
3 A2 B2 K1 K0 C2 D2 3.0
4 A3 B3 K2 K1 NaN NaN NaN
5 NaN NaN K2 K0 C3 D3 4.0
(3) Further set ['key1', 'key2'] as right index and merge it using the index
>>> right = right.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_index=True, how='outer').reset_index()
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K0 NaN NaN C3 D3 4.0
5 K2 K1 A3 B3 NaN NaN NaN
Please note that (1)(2)(3) above are showing the same results, and even if ['key1', 'key2'] are set as index, you can still use left_on = ['key1', 'key2'] instead of left_index=True.
Now, if you really want to merge using both ['key1', 'key2'] with index, one way to achieve this is:
>>> pd.merge(left.reset_index(), right.reset_index(), on=['index', 'key1', 'key2'], how='outer')
index key1 key2 A B C D E
0 0 K0 K0 A0 B0 C0 D0 1.0
1 1 K0 K1 A1 B1 NaN NaN NaN
2 2 K1 K0 A2 B2 C2 D2 3.0
3 3 K2 K1 A3 B3 NaN NaN NaN
4 1 K1 K0 NaN NaN C1 D1 2.0
5 3 K2 K0 NaN NaN C3 D3 4.0
If you read down to here, I'm pretty sure now you know how to achieve above using multiple different ways.
Hope this helps.

Categories

Resources