Pandas: Merge two dataframe columns - python

Consider two dataframes:
df_a = pd.DataFrame([
['a', 1],
['b', 2],
['c', NaN],
], columns=['name', 'value'])
df_b = pd.DataFrame([
['a', 1],
['b', NaN],
['c', 3],
['d', 4]
], columns=['name', 'value'])
So looking like
# df_a
name value
0 a 1
1 b 2
2 c NaN
# df_b
name value
0 a 1
1 b NaN
2 c 3
3 d 4
I want to merge these two dataframes and fill in the NaN values of the value column with the existing values in the other column. In other words, I want out:
# DESIRED RESULT
name value
0 a 1
1 b 2
2 c 3
3 d 4
Sure, I can do this with a custom .map or .apply, but I want a solution that uses merge or the like, not writing a custom merge function. How can this be done?

I think you can use combine_first:
print (df_b.combine_first(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Or fillna:
print (df_b.fillna(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Solution with update is not so common as combine_first:
df_b.update(df_a)
print (df_b)
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0

Related

Update a DataFrame with duplicate destination

I would like to update a dataframe with another one but with multiple "destination". Here is an example
df1 = pd.DataFrame({'name':['A', 'B', 'C', 'A'], 'category':['X', 'X', 'Y', 'Y'], 'value1':[None, 1, None, None], 'value2':[None, 10, None, None]})
name category value1 value2
0 A X NaN NaN
1 B X 1.0 10.0
2 C Y NaN NaN
3 A Y NaN NaN
df2 = pd.DataFrame({'name':['A', 'C'], 'value1':[2, 3], 'value2':[11, 12]})
name value1 value2
0 A 2 11
1 C 3 12
And the desired result would be
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0
I don't think pd.update works since there are two time 'A' in my first DataFrame.
pd.merge creates other columns and I think there is probably a more elegant way than to merge these columns manually after their creation
Thanks in advance for your help!
You can use fillna after mapping the column A in df1 with the corresponding values from df2:
mapping = df2.set_index('name')['value']
df1['value'] = df1['value'].fillna(df1['name'].map(mapping))
If you want to map multiple columns:
mapping = df2.set_index('name')
for col in mapping:
df1[col] = df1[col].fillna(df1['name'].map(mapping[col]))
Alternatively you can try merge:
df = df1.merge(df2, on='name', how='left', suffixes=['', '_r'])
df.groupby(df.columns.str.rstrip('_r'), axis=1, sort=False).first()
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0

Pandas merge dataframes with shared column, fillna in left with right

I am trying to merge two dataframes and replace the nan in the left df with the right df, I can do it with three lines of code as below, but I want to know if there is a better/shorter way?
# Example data (my actual df is ~500k rows x 11 cols)
df1 = pd.DataFrame({'a': [1,2,3,4], 'b': [0,1,np.nan, 1], 'e': ['a', 1, 2,'b']})
df2 = pd.DataFrame({'a': [1,2,3,4], 'b': [np.nan, 1, 0, 1]})
# Merge the dataframes...
df = df1.merge(df2, on='a', how='left')
# Fillna in 'b' column of left df with right df...
df['b'] = df['b_x'].fillna(df['b_y'])
# Drop the columns no longer needed
df = df.drop(['b_x', 'b_y'], axis=1)
The problem confusing merge is that both dataframes have a 'b' column, but the left and right versions have NaNs in mismatched places. You want to avoid getting unwanted multiple 'b' columns 'b_x', 'b_y' from merge in the first place:
slice the non-shared columns 'a','e' from df1
do merge(df2, 'left'), this will pick up 'b' from the right dataframe (since it only exists in the right df)
finally do df1.update(...) , this will update the NaNs in the column 'b' taken from df2 with df1['b']
Solution:
df1.update(df1[['a', 'e']].merge(df2, 'left'))
df1
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Note: Because I used merge(..., how='left'), I preserve the row order of the calling dataframe. If my df1 had values of a that were not in order
a b e
0 1 0.0 a
1 2 1.0 1
2 4 1.0 b
3 3 NaN 2
The result would be
df1.update(df1[['a', 'e']].merge(df2, 'left'))
df1
a b e
0 1 0.0 a
1 2 1.0 1
2 4 1.0 b
3 3 0.0 2
Which is as expected.
Further...
If you want to be more explicit when there may be more columns involved
df1.update(df1.drop('b', 1).merge(df2, 'left', 'a'))
Even Further...
If you don't want to update the dataframe, we can use combine_first
Quick
df1.combine_first(df1[['a', 'e']].merge(df2, 'left'))
Explicit
df1.combine_first(df1.drop('b', 1).merge(df2, 'left', 'a'))
EVEN FURTHER!...
The 'left' merge may preserve order but NOT the index. This is the ultra conservative approach:
df3 = df1.drop('b', 1).merge(df2, 'left', on='a').set_index(df1.index)
df1.combine_first(df3)
Short version
df1.b.fillna(df1.a.map(df2.set_index('a').b),inplace=True)
df1
Out[173]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Since you mentioned there will be multiple columns
df = df1.combine_first(df1[['a']].merge(df2, on='a', how='left'))
df
Out[184]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Also we can pass to fillna with df
df1.fillna(df1[['a']].merge(df2, on='a', how='left'))
Out[185]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Only if the indices are alligned (important note), we can use update:
df1['b'].update(df2['b'])
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Or simply fillna:
df1['b'].fillna(df2['b'], inplace=True)
If you're indices are not alligned, see WenNYoBen's answer or comment underneath.
You can mask the data.
original data:
print(df)
one two three
0 1 1.0 1.0
1 2 NaN 2.0
2 3 3.0 NaN
print(df2)
one two three
0 4 4 4
1 4 2 4
2 4 4 3
See below, mask just fills based on condition.
# mask values where isna()
df1[['two','three']] = df1[['two','three']]\
.mask(df1[['two','three']].isna(),df2[['two','three']])
output:
one two three
0 1 1.0 1.0
1 2 2.0 2.0
2 3 3.0 3.0

DataFrame take union of columns and retain find first non-NaN value

Dataframe df has many thousand columns and rows. For a subset of columns that are given in a particular sequence, say columns B, C, E, I want to fill NaN values in B with first non-NaN value found in remaining columns (C, E) searching sequentially. Finally C, E are dropped
Sample df can be built as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(10*(2+np.random.randn(6, 5)), columns=list('ABCDE'))
df.loc[1, 'B'] = np.nan
df.loc[2, 'B'] = np.nan
df.loc[5, 'B'] = np.nan
df.loc[2, 'C'] = np.nan
df.loc[5, 'C'] = np.nan
df.loc[2, 'D'] = np.nan
df.loc[2, 'E'] = np.nan
df.loc[4, 'E'] = np.nan
df
A B C D E
0 18.161033 6.453597 25.253036 18.542586 20.667311
1 27.629402 NaN 40.654821 22.804547 23.633502
2 15.459256 NaN NaN NaN NaN
3 19.115203 4.002131 14.167508 23.796780 29.557706
4 27.180622 NaN 20.763618 15.923794 NaN
5 17.917170 NaN NaN 21.865184 9.867743
The expected outcome is as follows:
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here is one way
drop = ['C', 'E']
fill= 'B'
d=dict(zip(df.columns,[fill if x in drop else x for x in df.columns.tolist() ]))
df.groupby(d,axis=1).first()
Out[172]:
A B D
0 14.472915 30.598602 24.528571
1 22.010242 22.215140 15.412039
2 5.383674 NaN NaN
3 38.265940 24.746673 35.367622
4 22.730089 20.244289 27.570413
5 31.216037 15.496690 9.746814
IIUC, use bfill to backfill, then drop to remove unwanted columns.
df.assign(B=df[['B', 'C', 'E']].bfill(axis=1)['B']).drop(['C', 'E'], axis=1)
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here's a slightly more generalised version of the one above,
to_drop = ['C', 'E']
upd = 'B'
df.update(df[[upd, *to_drop]].bfill(axis=1)[upd]) # in-place
df.drop(to_drop, axis=1) # not in-place, need to assign
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184

How to plot one column in different graphs?

I have the following problem. I have this kind of a dataframe:
f = pd.DataFrame([['Meyer', 2], ['Mueller', 4], ['Radisch', math.nan], ['Meyer', 2],['Pavlenko', math.nan]])
is there an elegant way to split the DataFrame up in several dataframes by the first column? So, I would like to get a dataframe where first column = 'Müller' and another one for first column = Radisch.
Thanks in advance,
Erik
You can loop by unique values of column A with boolean indexing:
df = pd.DataFrame([['Meyer', 2], ['Mueller', 4],
['Radisch', np.nan], ['Meyer', 2],
['Pavlenko', np.nan]])
df.columns = list("AB")
print (df)
A B
0 Meyer 2.0
1 Mueller 4.0
2 Radisch NaN
3 Meyer 2.0
4 Pavlenko NaN
print (df.A.unique())
['Meyer' 'Mueller' 'Radisch' 'Pavlenko']
for x in df.A.unique():
print(df[df.A == x])
A B
0 Meyer 2.0
3 Meyer 2.0
A B
1 Mueller 4.0
A B
2 Radisch NaN
A B
4 Pavlenko NaN
Then use dict comprehension - get dictionary of DataFrames:
dfs = {x:df[df.A == x].reset_index(drop=True) for x in df.A.unique()}
print (dfs)
{'Meyer': A B
0 Meyer 2.0
1 Meyer 2.0, 'Radisch': A B
0 Radisch NaN, 'Mueller': A B
0 Mueller 4.0, 'Pavlenko': A B
0 Pavlenko NaN}
print (dfs.keys())
dict_keys(['Meyer', 'Radisch', 'Mueller', 'Pavlenko'])
print (dfs['Meyer'])
A B
0 Meyer 2.0
1 Meyer 2.0
print (dfs['Pavlenko'])
A B
0 Pavlenko NaN

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...
I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

Categories

Resources