This question already has answers here:
Replacing Rows in Pandas DataFrame with Other DataFrame Based on Index
(3 answers)
Closed 6 months ago.
What I want to do
I have two pandas.DataFrame, df1 and df2. Both have the same columns.
All indices in df2 are also found in df1, but there are some indices that only df1 has.
Rows with an index that is owned by both df1 and df2, use rows of df2.
Rows with an index that is owned only by df1, use rows of df1.
In short, "replaces values of df1 with values of df2 based on MultiIndex".
import pandas as pd
index_names = ['index1', 'index2']
columns = ['column1', 'column2']
data1 = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]
index1 = [['i1', 'i1', 'i1', 'i2', 'i2'], ['A', 'B', 'C', 'B', 'C']]
df1 = pd.DataFrame(data1, index=pd.MultiIndex.from_arrays(index1, names=index_names), columns=columns)
print(df1)
## OUTPUT
# column1 column2
#index1 index2
#i1 A 1 2
# B 2 3
# C 3 4
#i2 B 4 5
# C 5 6
data2 = [[11, 12], [12, 13]]
index2 = [['i2', 'i1'], ['C', 'C']]
df2 = pd.DataFrame(data2, index=pd.MultiIndex.from_arrays(index2, names=index_names), columns=columns)
print(df2)
## OUTPUT
# column1 column2
#index1 index2
#i2 C 11 12
#i1 C 12 13
## DO SOMETHING!
## EXPECTED OUTPUT
# column1 column2
#index1 index2
#i1 A 1 2
# B 2 3
# C 12 13 # REPLACED!
#i2 B 4 5
# C 11 12 # REPLACED!
Environment
Python 3.10.5
Pandas 1.4.3
You can use direct assignment via .loc or a call to .update
>>> df3 = df1.copy()
>>> df3.update(df2)
>>> df3
column1 column2
index1 index2
i1 A 1.0 2.0
B 2.0 3.0
C 12.0 13.0
i2 B 4.0 5.0
C 11.0 12.0
You can try with pd.merge()on df2 and then fill the missing values from df1 with .fillna():
pd.merge(df2, df1, left_index=True, right_index=True, how='right', suffixes=('', '_df1')).fillna(df1).iloc[:,:2]
I assume that there is a typo in your question and it should be:
index2 = [['i2', 'i1'], ['C', 'C']]
Related
There are two dataframes, one dataframe might have less columns than another one. For instance,
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'B'],
'col2': [2, 9],
'col3': [0, 1]
})
df1 = pd.DataFrame({
'col1': ['G'],
'col2': [3]
})
The df and df1 are shown as follows.
I would like to combine these two dataframes together, and the missing values should be assigned as some given value, like -100. How to perform this kind of combination.
You could reindex the DataFrames first to "preserve" the dtypes; then concatenate:
cols = df.columns.union(df1.columns)
out = pd.concat([d.reindex(columns=cols, fill_value=-100) for d in [df, df1]],
ignore_index=True)
Output:
col1 col2 col3
0 A 2 0
1 B 9 1
2 G 3 -100
Use concat with DataFrame.fillna:
df = pd.concat([df, df1], ignore_index=True).fillna(-100)
print (df)
col1 col2 col3
0 A 2 0.0
1 B 9 1.0
2 G 3 -100.0
If need same dtypes add DataFrame.astype:
d = df.dtypes.append(df1.dtypes).to_dict()
df = pd.concat([df, df1], ignore_index=True).fillna(-100).astype(d)
print (df)
col1 col2 col3
0 A 2 0
1 B 9 1
2 G 3 -100
two sample dataframes with different index values, but identical column names and order:
df1 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[2,4])
df2 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[7,9])
df1
A B C
2 1 3
4 2
df2
A B C
7 4
9 5 6
I know how to concat the two dataframes, but that gives this:
A B C
2 1 3
4 2
Omitting the non=matching indexes from the other df
result I am trying to achieve is:
A B C
0 1 4 3
1 5 2 6
I want to combine the rows with the same index values from each df so that missing values in one df are replaced by the corresponding value in the other.
Concat and Merge are not up to the job I have found.
I assume I have to have identical indexes in each df which correspond to the values I want to merge into one row. But, so far, no luck getting it to come out correctly. Any pandas transformational wisdom is appreciated.
This merge attempt did not do the trick:
df1.merge(df2, on='A', how='outer')
The solutions below were all offered before I edited the question. My fault there, I neglected to point out that my actual data has different indexes in the two dataframes.
Let us try mask
out = df1.mask(df1=='',df2)
Out[428]:
A B C
0 1 4 3
1 5 2 6
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
if df1.iloc[i,j]=="":
df1.iloc[i,j] = df2.iloc[i,j]
print(df1)
A B C
0 1 4 3
1 5 2 6
Since the index of your two dataframes are different, it's easier to make it into the same index.
index = [i for i in range(len(df1))]
df1.index = index
df2.index = index
ddf = df1.replace('',np.nan)).fillna(df2)
If both df1 and df2 have different size of datas, it's still workable.
df1 = pd.DataFrame([[1, '', 3], ['', 2, ''],[7,8,9],[10,11,12]], columns=['A', 'B', 'C'],index=[7,8,9,10])
index1 = [i for i in range(len(df1))]
index2 = [i for i in range(len(df2))]
df1.index = index1
df2.index = index2
df1.replace('',np.nan).fillna(df2)
You can get
Out[17]:
A B C
0 1.0 5 3.0
1 4 2.0 6
2 7.0 8.0 9.0
3 10.0 11.0 12.0
Let's suppose I have a following dataframe:
df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'val': [0, 0, 0, 0, 0]})
I want to modify the column val with values from another dataframes like these:
df1 = pd.DataFrame({'id': [2, 3], 'val': [1, 1]})
df2 = pd.DataFrame({'id': [1, 5], 'val': [2, 2]})
I need a function merge_values_into_df that would work in the way to provide the following result:
df = merge_values_into_df(df1, on='id', field='val')
df = merge_values_into_df(df2, on='id', field='val')
print(df)
id val
0 1 2
1 2 1
2 3 1
3 4 0
4 5 2
I need an efficient (by CPU and memory) solution because I want to apply the approach to huge dataframes.
Use DataFrame.update with convert id to index in all DataFrames:
df = df.set_index('id')
df1 = df1.set_index('id')
df2 = df2.set_index('id')
df.update(df1)
df.update(df2)
df = df.reset_index()
print (df)
id val
0 1 2.0
1 2 1.0
2 3 1.0
3 4 0.0
4 5 2.0
You can concat all dataframes and drop_duplicated same id by keeping the last occurrence of id.
out = pd.concat([df, df1, df2]).drop_duplicates('id', keep='last')
print(out.sort_values('id', ignore_index=True))
# Output
id val
0 1 2
1 2 1
2 3 1
3 4 0
4 5 2
I have 2 dfs:
df1
a b
0 1 2
1 3 4
df2
c d
0 5 4
1 2 3
After concat, I get werid column names:
[In:]
df3=pd.concat([df1, df2], axis=1)
[Out:]
a b (c,) (d,)
0 1 2 5 4
1 3 4 2 3
df2 has had tuples in its columns before, maybe that's the reason.
If I try to get the dtypes, I get int64 for all columns.
If I just had to rename the columns, it would not be any problem, but it seems like operating with these columns brings up a problem with the dimension of these columns.
Does anyone understand the issue?
You can flatten the column index list using list comprehension:
df3.columns = [x for t in df3.columns.to_list() for x in t]
Example:
>>> df1 = pd.DataFrame({'a':[1, 3], 'b':[2, 4]})
>>> df2 = pd.DataFrame([[5, 4],[2, 3]], columns = pd.MultiIndex(levels=[[ 'c', 'd']], codes=[[0, 1]]))
>>> df3 = pd.concat([df1, df2], axis=1)
>>> df3
a b (c,) (d,)
0 1 2 5 4
1 3 4 2 3
>>> df3.columns = [x for t in df3.columns.to_list() for x in t]
>>> df3
a b c d
0 1 2 5 4
1 3 4 2 3
Flatten your column headers:
df1 = pd.DataFrame({'a':[1, 3], 'b':[2, 4]})
df2 = pd.DataFrame([[5, 4],[2, 3]], columns = pd.MultiIndex(levels=[[ 'c', 'd']], codes=[[0, 1]]))
df2.columns = df2.columns.map(''.join)
df3 = pd.concat([df1, df2], axis=1)
df3
Output:
a b c d
0 1 2 5 4
1 3 4 2 3
I have two data frames:
df1
A1 B1
1 a
2 s
3 d
and
df2
A1 B1
1 a
2 x
3 d
I want to compare df1 and df2 on column B1. The column A1 can be used to join. I want to know:
Which rows are different in df1 and df2 with respect to column B1?
If there is a mismatch in the values of column A1. For example whether df2 is missing some values that are there in df1 and vice versa. And if so, which ones?
I tried using merge and join but that is not what I am looking for.
I've edited the raw data to illustrate the case of A1 keys in one dataframe but not the other.
When doing your merge, you want to specify an 'outer' merge so that you can see these items with an A1 key in one dataframe but not the other.
I've included the suffixes '_1' and '_2' to indicate the dataframe source (_1 = df1 and _2 = df2) of column B1.
df1 = pd.DataFrame({'A1': [1, 2, 3, 4], 'B1': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'A1': [1, 2, 3, 5], 'B1': ['a', 'd', 'c', 'e']})
df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'])
df3['check'] = df3.B1_1 == df3.B1_2
>>> df3
A1 B1_1 B1_2 check
0 1 a a True
1 2 b d False
2 3 c c True
3 4 d NaN False
4 5 NaN e False
To check for missing A1 keys in df1 and df2:
# A1 value missing in `df1`
>>> d3[df3.B1_1.isnull()]
A1 B1_1 B1_2 check
4 5 NaN e False
# A1 value missing in `df2`
>>> df3[df3.B1_2.isnull()]
A1 B1_1 B1_2 check
3 4 d NaN False
EDIT
Thanks to #EdChum (the source of all Pandas knowledge...).
df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'], indicator=True)
df3['check'] = df3.B1_1 == df3.B1_2
>>> df3
A1 B1_1 B1_2 _merge check
0 1 a a both True
1 2 b d both False
2 3 c c both True
3 4 d NaN left_only False
4 5 NaN e right_only False