df1 = pd.DataFrame({'A':[3,5,2,5], 'B':['w','x','y','z'], 'C':['0','0','0','0']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
I'm trying to merge df1 and df2 on B and keep all A B C and D columns:
A B C D
0 3.0 w 1 10.0
1 5.0 x 2 20.0
2 2.0 y 3 30.0
3 5.0 z 4 40.0
I've tried df1.merge(df2, how='outer', on='B')
A B C_x C_y D
0 3 w 0 1 10
1 5 x 0 2 20
2 2 y 0 3 30
3 5 z 0 4 40
which is almost what I want, but need C in df2 to replace C in df1. How can I achieve that?
If you don't want C from the lefthand side at all you could simply drop it before the merge:
df1 = pd.DataFrame({'A':[3,5,2,5], 'B':['w','x','y','z'], 'C':['0','0','0','0']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
result = pd.merge(
df1.drop('C', axis=1),
df2,
how='outer',
on='B')
A B C D
0 3 w 1 10
1 5 x 2 20
2 2 y 3 30
3 5 z 4 40
Edit:
However, if you're wanting to combine in cases where you don't have C from df2 you could utilize combine_first():
df1 = pd.DataFrame({'A':[3,5,2,5,6], 'B':['w','x','y','z','q'], 'C':['0','0','0','0','88']})
df2 = pd.DataFrame({'B':['w','x','y','z'],'C':['1','2','3','4'], 'D':[10,20,30,40]})
result_2 = pd.merge(df1, df2, how='outer', on='B')
result_2['C'] = result_2['C_y'].combine_first(result_2['C_x'])
result_2.drop(['C_x', 'C_y'], axis=1, inplace=True)
A B D C
0 3 w 10.0 1
1 5 x 20.0 2
2 2 y 30.0 3
3 5 z 40.0 4
4 6 q 88
here is one way to do it, by choosing the column you need to include in the merge
df1[['A','B']].merge(df2,
on='B',
how='outer')
A B C D
0 3 w 1 10
1 5 x 2 20
2 2 y 3 30
3 5 z 4 40
I have two DataFrames:
df = pd.DataFrame({'A':[1,2],
'B':[3,4]})
A B
0 1 3
1 2 4
df2 = pd.DataFrame({'A':[3,2,1],
'C':[5,6,7]})
A C
0 3 5
1 2 6
2 1 7
and I want to merge in a way that the column 'A' add the different values between DataFrames but merge the duplicates.
Desired output:
A B C
0 3 NaN 5
1 2 4 6
2 1 3 7
You can use combine_first:
df2 = df2.combine_first(df)
Output:
A B C
0 1 3.0 5
1 2 4.0 6
2 3 NaN 7
I have a df
a b c
0 3 0
1 1 4
2 3 3
4 4 1
I want to compare a and b to c. If a value in the same row is equal to c I want 'nan' in a and/or b.
Like that:
a b c
nan 3 0
1 1 4
2 nan 3
4 4 1
We can use to_numpy with DataFrame.mask for this:
eqs = df.loc[:, :'b'].eq(df['c'].to_numpy()[:, None])
df.loc[:, :'b'] = df.loc[:, :'b'].mask(eqs)
a b c
0 NaN 3.0 0
1 1.0 1.0 4
2 2.0 NaN 3
3 4.0 4.0 1
I want to fill the NaN values on my dataframe on column c with the mean for only rows who has as category B, and ignore the others.
print (df)
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 NaN
4 A 2 1.0
5 B 2 Nan
6 C 1 3.0
7 C 1 2.0
8 B 1 NaN
So what I'm doing for the moment is :
df.c = df.c.fillna(df.c.mean())
But it fill all the NaN values, while I want only to fill the 3rd, 5th and the 8th rows who had category value equal to B.
Combine fillna with slicing assignment
df.loc[df.Category.eq('B'), 'c'] = (df.loc[df.Category.eq('B'), 'c'].
fillna(df.c.mean()))
Out[736]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0
Or a direct assignment with 2 masks
pandas.DataFrame.eq is the element wise equality operator.
df.loc[df.Category.eq('B') & df.c.isna(), 'c'] = df.c.mean()
Out[745]:
Category b c
0 A 1 5.0
1 C 1 NaN
2 A 1 4.0
3 B 2 3.0
4 A 2 1.0
5 B 2 3.0
6 C 1 3.0
7 C 1 2.0
8 B 1 3.0
This would be the answer for your question:
df.c = df.apply(
lambda row: row['c'].fillna(df.c.mean()) if row['Category']=='B' else row['c'] ,axis=1)
What is the best way to add the contents of two dataframes, which have mostly equivalent indices:
df1:
A B C
A 0 3 1
B 3 0 2
C 1 2 0
df2:
A B C D
A 0 1 1 0
B 1 0 3 2
C 1 3 0 0
D 0 2 0 0
df1 + df2 =
A B C D
A 0 4 2 0
B 4 0 5 2
C 2 5 0 0
D 0 2 0 0
You can also concat both the dataframes since concatenation (by default) happens by index.
# sample dataframe
df1 = pd.DataFrame({'a': [1,2,3], 'b':[2,3,4]}, index=['a','c','e'])
df2 = pd.DataFrame({'a': [10,20], 'b':[11,22]}, index=['b','d'])
new_df= pd.concat([df1, df2]).sort_index()
print(new_df)
a b
a 1 2
b 10 11
c 2 3
d 20 22
e 3 4
I think you can just add:
In [625]: df1.add(df2,fill_value=0)
Out[625]:
A B C D
A 0.0 4.0 2.0 0.0
B 4.0 0.0 5.0 2.0
C 2.0 5.0 0.0 0.0
D 0.0 2.0 0.0 0.0