I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
I am trying to merge two dataframes and replace the nan in the left df with the right df, I can do it with three lines of code as below, but I want to know if there is a better/shorter way?
# Example data (my actual df is ~500k rows x 11 cols)
df1 = pd.DataFrame({'a': [1,2,3,4], 'b': [0,1,np.nan, 1], 'e': ['a', 1, 2,'b']})
df2 = pd.DataFrame({'a': [1,2,3,4], 'b': [np.nan, 1, 0, 1]})
# Merge the dataframes...
df = df1.merge(df2, on='a', how='left')
# Fillna in 'b' column of left df with right df...
df['b'] = df['b_x'].fillna(df['b_y'])
# Drop the columns no longer needed
df = df.drop(['b_x', 'b_y'], axis=1)
The problem confusing merge is that both dataframes have a 'b' column, but the left and right versions have NaNs in mismatched places. You want to avoid getting unwanted multiple 'b' columns 'b_x', 'b_y' from merge in the first place:
slice the non-shared columns 'a','e' from df1
do merge(df2, 'left'), this will pick up 'b' from the right dataframe (since it only exists in the right df)
finally do df1.update(...) , this will update the NaNs in the column 'b' taken from df2 with df1['b']
Solution:
df1.update(df1[['a', 'e']].merge(df2, 'left'))
df1
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Note: Because I used merge(..., how='left'), I preserve the row order of the calling dataframe. If my df1 had values of a that were not in order
a b e
0 1 0.0 a
1 2 1.0 1
2 4 1.0 b
3 3 NaN 2
The result would be
df1.update(df1[['a', 'e']].merge(df2, 'left'))
df1
a b e
0 1 0.0 a
1 2 1.0 1
2 4 1.0 b
3 3 0.0 2
Which is as expected.
Further...
If you want to be more explicit when there may be more columns involved
df1.update(df1.drop('b', 1).merge(df2, 'left', 'a'))
Even Further...
If you don't want to update the dataframe, we can use combine_first
Quick
df1.combine_first(df1[['a', 'e']].merge(df2, 'left'))
Explicit
df1.combine_first(df1.drop('b', 1).merge(df2, 'left', 'a'))
EVEN FURTHER!...
The 'left' merge may preserve order but NOT the index. This is the ultra conservative approach:
df3 = df1.drop('b', 1).merge(df2, 'left', on='a').set_index(df1.index)
df1.combine_first(df3)
Short version
df1.b.fillna(df1.a.map(df2.set_index('a').b),inplace=True)
df1
Out[173]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Since you mentioned there will be multiple columns
df = df1.combine_first(df1[['a']].merge(df2, on='a', how='left'))
df
Out[184]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Also we can pass to fillna with df
df1.fillna(df1[['a']].merge(df2, on='a', how='left'))
Out[185]:
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Only if the indices are alligned (important note), we can use update:
df1['b'].update(df2['b'])
a b e
0 1 0.0 a
1 2 1.0 1
2 3 0.0 2
3 4 1.0 b
Or simply fillna:
df1['b'].fillna(df2['b'], inplace=True)
If you're indices are not alligned, see WenNYoBen's answer or comment underneath.
You can mask the data.
original data:
print(df)
one two three
0 1 1.0 1.0
1 2 NaN 2.0
2 3 3.0 NaN
print(df2)
one two three
0 4 4 4
1 4 2 4
2 4 4 3
See below, mask just fills based on condition.
# mask values where isna()
df1[['two','three']] = df1[['two','three']]\
.mask(df1[['two','three']].isna(),df2[['two','three']])
output:
one two three
0 1 1.0 1.0
1 2 2.0 2.0
2 3 3.0 3.0
I have a df:
df = pd.DataFrame([[1,1],[3,4],[3,4]], columns=["a", 'b'])
a b
0 1 1
1 3 4
2 3 4
I have to filter this df based on a query. The query can be complex, but here I'm using a simple one:
items = [3,4]
df.query("a in #items and b == 4")
a b
1 3 4
2 3 4
Only to these rows I would like to add some values in new columns:
configuration = {'c': 'action', "d": "non-action"}
for k, v in configuration.items():
df[k] = v
The rest of the rows should have an empty value or np.nan. So my end df should look like:
a b c d
0 1 1 np.nan np.nan
1 3 4 action non-action
2 3 4 action non-action
The issue is that to do the query I end up with a copy of a dataframe. And then I have to somehow merged them and replace the modified rows by index. How to do it without replacing in the original df the rows by index with the queried one?
Using combine_first with assign
df.query("a in #items and b == 4").assign(**configuration).combine_first(df)
Out[138]:
a b c d
0 1.0 1.0 NaN NaN
1 3.0 4.0 action non-action
2 3.0 4.0 action non-action
I have to dataframes that look like this:
df1: condition
A
A
A
B
B
B
B
df2: condition value
A 1
B 2
I would like to assign to each condition its value, adding a column to df1 in order to obtain:
df1: condition value
A 1
A 1
A 1
B 2
B 2
B 2
B 2
how can I do this? thank you in advance!
Use map by Series created by set_index if need append one column only:
df1['value'] = df1['condition'].map(df2.set_index('condition')['value'])
print (df1)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2
Or use merge with left join if df2 have more columns:
df = df1.merge(df2, on='condition', how='left')
print (df)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2
I have a dataframe df something like this
A B C
1 'x' 15.0
2 'y' NA
3 'z' 25.0
and a dictionary dc something like
dc = {'x':15,'y':35,'z':25}
I want to fill all nulls in column C of the dataframe using values of column B from the dictionary. So that my dataframe will become
A B C
1 'x' 15
2 'y' 35
3 'z' 25
Could anyone help me how to do that please?
thanks,
Manoj
You can use fillna with map:
dc = {'x':15,'y':35,'z':25}
df['C'] = df.C.fillna(df.B.map(dc))
df
# A B C
#0 1 x 15.0
#1 2 y 35.0
#2 3 z 25.0
df['C'] = np.where(df['C'].isnull(), df['B'].apply(lambda x: dc[x]), df['C'])