I am trying to merge two dataframes and replace the nan in the left df with the right df, I can do it with three lines of code as below, but I want to know if there is a better/shorter way?
# Example data (my actual df is ~500k rows x 11 cols)
df1 = pd.DataFrame({'a': [1,2,3,4], 'b': [0,1,np.nan, 1], 'e': ['a', 1, 2,'b']})
df2 = pd.DataFrame({'a': [1,2,3,4], 'b': [np.nan, 1, 0, 1]})
# Merge the dataframes...
df = df1.merge(df2, on='a', how='left')
# Fillna in 'b' column of left df with right df...
df['b'] = df['b_x'].fillna(df['b_y'])
# Drop the columns no longer needed
df = df.drop(['b_x', 'b_y'], axis=1)
The problem confusing merge is that both dataframes have a 'b' column, but the left and right versions have NaNs in mismatched places. You want to avoid getting unwanted multiple 'b' columns 'b_x', 'b_y' from merge in the first place:
slice the non-shared columns 'a','e' from df1
do merge(df2, 'left'), this will pick up 'b' from the right dataframe (since it only exists in the right df)
finally do df1.update(...) , this will update the NaNs in the column 'b' taken from df2 with df1['b']
Solution:
df1.update(df1[['a', 'e']].merge(df2, 'left'))
df1
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Note: Because I used merge(..., how='left'), I preserve the row order of the calling dataframe. If my df1 had values of a that were not in order
a b e
0 1 0.0 a
1 2 1.0 1
2 4 1.0 b
3 3 NaN 2
The result would be
df1.update(df1[['a', 'e']].merge(df2, 'left'))
df1
a b e
0 1 0.0 a
1 2 1.0 1
2 4 1.0 b
3 3 0.0 2
Which is as expected.
Further...
If you want to be more explicit when there may be more columns involved
df1.update(df1.drop('b', 1).merge(df2, 'left', 'a'))
Even Further...
If you don't want to update the dataframe, we can use combine_first
Quick
df1.combine_first(df1[['a', 'e']].merge(df2, 'left'))
Explicit
df1.combine_first(df1.drop('b', 1).merge(df2, 'left', 'a'))
EVEN FURTHER!...
The 'left' merge may preserve order but NOT the index. This is the ultra conservative approach:
df3 = df1.drop('b', 1).merge(df2, 'left', on='a').set_index(df1.index)
df1.combine_first(df3)
Short version
df1.b.fillna(df1.a.map(df2.set_index('a').b),inplace=True)
df1
Out[173]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Since you mentioned there will be multiple columns
df = df1.combine_first(df1[['a']].merge(df2, on='a', how='left'))
df
Out[184]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Also we can pass to fillna with df
df1.fillna(df1[['a']].merge(df2, on='a', how='left'))
Out[185]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Only if the indices are alligned (important note), we can use update:
df1['b'].update(df2['b'])
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Or simply fillna:
df1['b'].fillna(df2['b'], inplace=True)
If you're indices are not alligned, see WenNYoBen's answer or comment underneath.
You can mask the data.
original data:
print(df)
one two three
0 1 1.0 1.0
1 2 NaN 2.0
2 3 3.0 NaN
print(df2)
one two three
0 4 4 4
1 4 2 4
2 4 4 3
See below, mask just fills based on condition.
# mask values where isna()
df1[['two','three']] = df1[['two','three']]\
.mask(df1[['two','three']].isna(),df2[['two','three']])
output:
one two three
0 1 1.0 1.0
1 2 2.0 2.0
2 3 3.0 3.0
Related
I have a data frame with numeric values, such as
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
and I append a single row with all the column sums
totals = df.sum()
totals.name = 'totals'
df_append = df.append(totals)
Simple enough.
Here are the values of df, totals, and df_append
>>> df
A B
0 1 2
1 3 4
>>> totals
A 4
B 6
Name: totals, dtype: int64
>>> df_append
A B
0 1 2
1 3 4
totals 4 6
Unfortunately, in newer versions of pandas the method DataFrame.append is deprecated, and will be removed in some future version of pandas. The advise is to replace it with pandas.concat.
Now, using pd.concat naively as follows
df_concat_bad = pd.concat([df, totals])
produces
>>> df_concat_bad
A B 0
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
Apparently, with df.append the Series object got interpreted as a row, but with pd.concat it got interpreted as a column.
You cannot fix this with something like calling pd.concat with axis=1, because that would add the totals as column:
>>> pd.concat([df, totals], axis=1)
A B totals
0 1.0 2.0 NaN
1 3.0 4.0 NaN
A NaN NaN 4.0
B NaN NaN 6.0
(In this case, the result looks the same as using the default axis=0, because the indexes of df and totals are disjoint, as are their column names.)
How to handle this (elegantly and efficiently)?
The solution is to convert totals (a Series object) to a DataFrame (which will then be a column) using to_frame() and next transpose it using T:
df_concat_good = pd.concat([df, totals.to_frame().T])
yields the desired
>>> df_concat_good
A B
0 1 2
1 3 4
totals 4 6
I prefer to use df.loc() to solve this problem than pd.concat()
df.loc["totals"]=df.sum()
I have a dataframe that looks like this
import pandas as pd
import numpy as np
fff = pd.DataFrame({'group': ['a','a','a','b','b','b','b','c','c'], 'value': [1,2, np.nan, 1,2,3,4, np.nan, np.nan]})
I would like to drop the NAs by group only if all values are Nas inside the group. How could i do that ?
Expected output:
fff = pd.DataFrame({'group': ['a','a','a','b','b','b','b'], 'value': [1,2, np.nan, 1,2,3,4]})
You can check value for nan and use groupby().any():
fff = fff[(~fff['value'].isna()).groupby(fff['group']).transform('any')]
Output:
group value
0 a 1.0
1 a 2.0
2 a NaN
3 b 1.0
4 b 2.0
5 b 3.0
6 b 4.0
create a boolean series with isna() and then group on fff['group'], and transform with all , then filter out(exclude) values which return True
c = fff['value'].isna()
fff[~c.groupby(fff['group']).transform('all')]
group value
0 a 1.0
1 a 2.0
2 a NaN
3 b 1.0
4 b 2.0
5 b 3.0
6 b 4.0
Another option:
fff["cases"] = fff.groupby("group").cumcount()
fff["null"] = fff["value"].isnull()
fff["cases 2"] = fff.groupby(["group","null"]).cumcount()
fff[~((fff["value"].isnull()) & (fff["cases"] == fff["cases 2"]))][["group","value"]]
Output:
group value
0 a 1.0
1 a 2.0
2 a NaN
3 b 1.0
4 b 2.0
5 b 3.0
6 b 4.0
An addition to the answers already provided : Keep only groups where all the values are True, and filter the fff dataframe with the result variable.
result = fff.groupby("group").value.all().index.tolist()
fff.query("group == #result")
I have a df:
df = pd.DataFrame([[1,1],[3,4],[3,4]], columns=["a", 'b'])
a b
0 1 1
1 3 4
2 3 4
I have to filter this df based on a query. The query can be complex, but here I'm using a simple one:
items = [3,4]
df.query("a in #items and b == 4")
a b
1 3 4
2 3 4
Only to these rows I would like to add some values in new columns:
configuration = {'c': 'action', "d": "non-action"}
for k, v in configuration.items():
df[k] = v
The rest of the rows should have an empty value or np.nan. So my end df should look like:
a b c d
0 1 1 np.nan np.nan
1 3 4 action non-action
2 3 4 action non-action
The issue is that to do the query I end up with a copy of a dataframe. And then I have to somehow merged them and replace the modified rows by index. How to do it without replacing in the original df the rows by index with the queried one?
Using combine_first with assign
df.query("a in #items and b == 4").assign(**configuration).combine_first(df)
Out[138]:
a b c d
0 1.0 1.0 NaN NaN
1 3.0 4.0 action non-action
2 3.0 4.0 action non-action
I have 2 data frames with identical columns. Column 'key' will have unique values.
Data frame 1:-
A B key C
0 1 k1 2
1 2 k2 3
2 3 k3 5
Data frame 2:-
A B key C
4 5 k1 2
1 2 k2 3
2 3 k4 5
I would like to update rows in Dataframe-1 with values in Dataframe -2 if key in Dataframe -2 matches with Dataframe -1.
Also if key is new then add entire row from Dataframe-2 to Dataframe-1.
Final Output Dataframe is like this with same columns.
A B key C
4 5 k1 2 --> update
1 2 k2 3 --> no changes
2 3 k3 5 --> no changes
2 3 k4 5 --> new row
I have tried with below code. I need only 4 columns 'A', 'B','Key','C' without any suffixes after merge.
df3 = df1.merge(df2,on='key',how='outer')
>>> df3
A_x B_x key C_x A_y B_y C_y
0 0.0 1.0 k1 2.0 4.0 5.0 2.0
1 1.0 2.0 k2 3.0 1.0 2.0 3.0
2 2.0 3.0 k3 5.0 NaN NaN NaN
3 NaN NaN k4 NaN 2.0 3.0 5.0
It seems like you're looking for combine_first.
a = df2.set_index('key')
b = df1.set_index('key')
(a.combine_first(b)
.reset_index()
.reindex(columns=df1.columns))
A B key C
0 4.0 5.0 k1 2.0
1 1.0 2.0 k2 3.0
2 2.0 3.0 k3 5.0
3 2.0 3.0 k4 5.0
try this:
df1 = {'key': ['k1', 'k2', 'k3'], 'A':[0,1,2], 'B': [1,2,3], 'C':[2,3,5]}
df1 = pd.DataFrame(data=df1)
print (df1)
df2 = {'key': ['k1', 'k2', 'k4'], 'A':[4,1,2], 'B': [5,2,3], 'C':[2,3,5]}
df2 = pd.DataFrame(data=df2)
print (df2)
df3 = df1.append(df2)
df3.drop_duplicates(subset=['key'], keep='last', inplace=True)
df3 = df3.sort_values(by=['key'], ascending=True)
print (df3)
First, you need to indicate index columns:
df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)
Then, combine the dataframes to get all the index keys in place (this will not update the df1 values! See: combine_first manual):
df1 = df1.combine_first(df2)
Last step is updating the values in df1 with df2 and resetting the index
df1.update(df2)
df1.reset_index(inplace=True)
Try to append and remove duplicates:
df3 = pd.drop_duplicates(df1.append(df2))
assumes both dataframes have the same index columns
df3 = df1.combine_first(df2)
df3.update(df2)
Consider two dataframes:
df_a = pd.DataFrame([
['a', 1],
['b', 2],
['c', NaN],
], columns=['name', 'value'])
df_b = pd.DataFrame([
['a', 1],
['b', NaN],
['c', 3],
['d', 4]
], columns=['name', 'value'])
So looking like
# df_a
name value
0 a 1
1 b 2
2 c NaN
# df_b
name value
0 a 1
1 b NaN
2 c 3
3 d 4
I want to merge these two dataframes and fill in the NaN values of the value column with the existing values in the other column. In other words, I want out:
# DESIRED RESULT
name value
0 a 1
1 b 2
2 c 3
3 d 4
Sure, I can do this with a custom .map or .apply, but I want a solution that uses merge or the like, not writing a custom merge function. How can this be done?
I think you can use combine_first:
print (df_b.combine_first(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Or fillna:
print (df_b.fillna(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Solution with update is not so common as combine_first:
df_b.update(df_a)
print (df_b)
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0